SlideShare a Scribd company logo
1 of 25
Download to read offline
for System Administrators – Hadoop for System Administrators O –h iOo hLiion uLxi nFuexs tF 2e0s1t 42014 
Justin Miller 
Senior Systems Engineer/DevOps at iHealth Technologies 
Weston Bassler 
Systems Engineer at Verizon Wireless
Hadoop for System Administrators – Ohio Linux Fest 2014 
What we will be covering: 
Intro 
Why Hadoop? 
How Hadoop Works 
Architecture 
Planning Hardware/Storage/Network 
Processing and Storage 
HDFS Components 
YARN Components 
Operations 
Job scheduling 
Jobs alerts 
Monitoring 
Core Services 
Job scheduler and SLA 
Hardware 
High Availability 
YARN 
HDFS 
Oozie 
Security 
Security Issues 
Authentication 
Authorization 
Encrption 
Backup and Recovery 
What to plan for? 
How to combat 
Hadoop Vendors/Distros 
Cloudera 
HortonWorks 
MapR
Hadoop for System Administrators – Ohio Linux Fest 2014 
Why Hadoop?
Hadoop for System Administrators – Ohio Linux Fest 2014 
Why Hadoop? Cont... 
Sort through TB, even PB worth of data in a matter of minutes 
Easily sift through LOGS (patterns, data mining) → switch logs, application 
logs 
Batch Processing 
History → Inspired by 2 Google Papers on MapReduce and GoogleFS 
Implemented By Yahoo!
Hadoop for System Administrators – Ohio Linux Fest 2014 
Whose using it?
Hadoop for System Administrators – Ohio Linux Fest 2014 
How Hadoop? 
Processing 
• MapReduce (MRv1) 
What is MapReduce? 
Nobody likes it 
• YARN (MRv2) 
Yet Another Resource Negotiator 
Newer better/versatile 
2 New Roles → Resource Manager and Application Manager 
Spark → New Hotness 
• Bringing Processing and Storage together 
Data locality → avoid network! 
“MO NODES MO BETTA”
Hadoop for System Administrators – Ohio Linux Fest 2014 
YARN in Action
Hadoop for System Administrators – Ohio Linux Fest 2014 
Storage 
• HDFS 
What is HDFS? 
Why HDFS? 
• Components of HDFS 
NameNode 
Metadata → fsimage + fsedits 
ZooKeeper → HA management 
Quorum based journaling 
3 JournalNodes 
Active/Passive NameNode 
DataNodes – what do they do? 
Blocks in relation to NameNode Metadata 
Block storage
Hadoop for System Administrators – Ohio Linux Fest 2014 
HDFS Write Path
Hadoop for System Administrators – Ohio Linux Fest 2014 
Benefits and Limitations of HDFS 
Benefits 
Low cost per byte → commodity storage 
High Bandwidth/Scales effectively → “Mo nodes Mo speed” 
Rock solid data reliability 
Supports distributed computing I/O patterns 
OPEN SOURCE!!!!!
Hadoop for System Administrators – Ohio Linux Fest 2014 
Benefits and Limitations of HDFS (Continued...) 
Limitations 
Updates → data is immutable (can't be updated only appended) 
Write Once 
Optimized for sequential reads → not for real-time data processing 
Challenging import/export → requires additional tooling
Hadoop for System Administrators – Ohio Linux Fest 2014 
Architectur e 
• Planning your Hardware/Storage 
Cheap disks 
Distributed disk approach → replication factor of 3 for HA 
NO LVM and NO Raid and NO swap 
noatime, nodiratime 
• Network considerations 
Rack awareness affects data distribution 
Prefer a faster network when available → 10GB if possible
Hadoop for System Administrators – Ohio Linux Fest 2014 
Hadoop Operations 
• Jobs 
What is a job? 
Scheduling jobs with Oozie 
Alerts on Jobs 
Oozie SLAs → Start time, end time & duration 
File driven Job Configuration
Hadoop for System Administrators – Ohio Linux Fest 2014 
Example of a Job: 
Example of a coordinator:
Hadoop for System Administrators – Ohio Linux Fest 2014 
Troubleshooting 
• Application → Debug Code
Hadoop for System Administrators – Ohio Linux Fest 2014 
• Job → Debug Execution
Hadoop for System Administrators – Ohio Linux Fest 2014 
• Service → Debug Linux Process (/var/log/hadoop-*) 
Services wont start → port conflicts (nmap, netstat, lsof) 
if not application OR job; 
do 
cat /var/log/hadoop-* | grep ERROR 
done
Hadoop for System Administrators – Ohio Linux Fest 2014 
Monitoring 
• Core Services 
HDFS 
YARN 
JMX → JVM Monitoring 
Cloudera Manager 
• Performance 
Ganglia (HortonWorks) 
Cloudera Manager 
• Hardware → to each his own (traditional monitoring) 
SNMP 
Nagios 
Zenoss 
Cloudera Manager
Hadoop for System Administrators – Ohio Linux Fest 2014 
High Availability 
• HDFS 
ZooKeeper → quorum based journaling 
• YARN 
ZooKeeper
Hadoop for System Administrators – Ohio Linux Fest 2014 
• Oozie HA
Hadoop for System Administrators – Ohio Linux Fest 2014 
Security (Because people are evil)
Hadoop for System Administrators – Ohio Linux Fest 2014 
Security Continued.... 
• Known issues – Stupid/Lazy People 
Hadoop can be very secure 
• Authentication - Kerberos 
Principal (user) 
Realm (group of principals) 
Keytab file 
• Authorization 
LDAP 
Active Directory 
Role based 
• Encryption – For your eyes Only! 
Kerberos 1st 
SSL Certificates 
**** SSL must be enabled for all core Hadoop services
Hadoop for System Administrators – Ohio Linux Fest 2014 
Backup and Recovery – When things go wrong (And they will) 
What can go wrong? What to plan for? 
Data Corruption 
Node crashes 
Disk crashes 
Ways to combat when things do go wrong 
• Data Corruption 
checksums of metadata fail → NameNode replaces with fresh 
HDFS → hdfs fsck tool 
• Node crashes/Disk crashes 
HDFS saves the day! 
NameNode HA 
First 2 replicas of data on different hosts 
Heartbeat detection
Hadoop for System Administrators – Ohio Linux Fest 2014 
Hadoop Wars - Vendors and Distributions 
• Cloudera 
Specializes in Enterprise tools 
Auditing 
Access Control 
Cluster Management (Cloudera Manager) 
• HortonWorks 
Specializes in Engineering 
Also Open Source 
Top new cool things 
• MapR 
Lead developers begin Mahout
Hadoop for System Administrators – Ohio Linux Fest 2014 
Hopefully you enjoyed! 
Slide Share Link: http://www.slideshare.net/mageru/hadoop-for-sysadmin 
If interested: 
Quick Ways to get started Learning Hadoop 
• Free Stuff – Who doesn't like free? 
Big Data University – Hadoop fundamentals, Pig, Oozie, lots more 
Udactity – Intro to Hadoop and Mapreduce 
MapR, Cloudera, HortonWorks – Training Videos

More Related Content

What's hot

Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Lucidworks
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
DataWorks Summit
 

What's hot (20)

Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Lecture 2 part 2
Lecture 2 part 2Lecture 2 part 2
Lecture 2 part 2
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Automation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure DataAutomation of Hadoop cluster operations in Arm Treasure Data
Automation of Hadoop cluster operations in Arm Treasure Data
 
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, ClouderaSolr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
Solr on HDFS - Past, Present, and Future: Presented by Mark Miller, Cloudera
 
Hadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA confHadoop2 new and noteworthy SNIA conf
Hadoop2 new and noteworthy SNIA conf
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Architecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an exampleArchitecting application with Hadoop - using clickstream analytics as an example
Architecting application with Hadoop - using clickstream analytics as an example
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
 
Apache HBase: State of the Union
Apache HBase: State of the UnionApache HBase: State of the Union
Apache HBase: State of the Union
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data ProcessingApache Tez - A unifying Framework for Hadoop Data Processing
Apache Tez - A unifying Framework for Hadoop Data Processing
 
De-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-CloudDe-Bugging Hive with Hadoop-in-the-Cloud
De-Bugging Hive with Hadoop-in-the-Cloud
 
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
 

Viewers also liked

Amdahl's Law Performance Presentation
Amdahl's Law Performance PresentationAmdahl's Law Performance Presentation
Amdahl's Law Performance Presentation
Bob Hindy
 
Thinking in Terms of a Data Center Operating System
Thinking in Terms of a Data Center Operating SystemThinking in Terms of a Data Center Operating System
Thinking in Terms of a Data Center Operating System
Weston Bassler
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
askme
 

Viewers also liked (20)

Amdahl's Law Performance Presentation
Amdahl's Law Performance PresentationAmdahl's Law Performance Presentation
Amdahl's Law Performance Presentation
 
Classical Distributed Computing Studies. Washington DC Apache Spark Interacti...
Classical Distributed Computing Studies. Washington DC Apache Spark Interacti...Classical Distributed Computing Studies. Washington DC Apache Spark Interacti...
Classical Distributed Computing Studies. Washington DC Apache Spark Interacti...
 
Thinking in Terms of a Data Center Operating System
Thinking in Terms of a Data Center Operating SystemThinking in Terms of a Data Center Operating System
Thinking in Terms of a Data Center Operating System
 
Mips1
Mips1Mips1
Mips1
 
Amdahl`s law -Processor performance
Amdahl`s law -Processor performanceAmdahl`s law -Processor performance
Amdahl`s law -Processor performance
 
Modern Elastic Datacenter Architecture
Modern Elastic Datacenter ArchitectureModern Elastic Datacenter Architecture
Modern Elastic Datacenter Architecture
 
Data representation moris mano ch 03
Data representation   moris mano ch  03Data representation   moris mano ch  03
Data representation moris mano ch 03
 
Chapter 5
Chapter 5Chapter 5
Chapter 5
 
Cache mapping
Cache mappingCache mapping
Cache mapping
 
Profiling PrestaShop
Profiling PrestaShopProfiling PrestaShop
Profiling PrestaShop
 
Data representation
Data representationData representation
Data representation
 
03 mips assembly language
03 mips assembly language03 mips assembly language
03 mips assembly language
 
Cache memory
Cache memoryCache memory
Cache memory
 
Address translation-mechanism-of-80386 by aniket bhute
Address translation-mechanism-of-80386 by aniket bhuteAddress translation-mechanism-of-80386 by aniket bhute
Address translation-mechanism-of-80386 by aniket bhute
 
Memory Mapping Cache
Memory Mapping CacheMemory Mapping Cache
Memory Mapping Cache
 
Instruction Set Architecture – II
Instruction Set Architecture – IIInstruction Set Architecture – II
Instruction Set Architecture – II
 
Instruction Set Architecture
Instruction Set ArchitectureInstruction Set Architecture
Instruction Set Architecture
 
RISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van NeumannRISC Vs CISC, Harvard v/s Van Neumann
RISC Vs CISC, Harvard v/s Van Neumann
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Types of instructions
Types of instructionsTypes of instructions
Types of instructions
 

Similar to Hadoop for System Administrators

9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
Manoel Ribeiro
 

Similar to Hadoop for System Administrators (20)

Discover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop SearchDiscover HDP 2.1: Apache Solr for Hadoop Search
Discover HDP 2.1: Apache Solr for Hadoop Search
 
Introduction to HDFS and MapReduce
Introduction to HDFS and MapReduceIntroduction to HDFS and MapReduce
Introduction to HDFS and MapReduce
 
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFSDiscover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
Discover HDP 2.1: Apache Hadoop 2.4.0, YARN & HDFS
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed FilesystemHDFS: Hadoop Distributed Filesystem
HDFS: Hadoop Distributed Filesystem
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
 
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in HadoopDiscover HDP 2.1: Apache Falcon for Data Governance in Hadoop
Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Introduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystemIntroduction to the Hadoop EcoSystem
Introduction to the Hadoop EcoSystem
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
Predictive Analytics and Machine Learning…with SAS and Apache HadoopPredictive Analytics and Machine Learning…with SAS and Apache Hadoop
Predictive Analytics and Machine Learning …with SAS and Apache Hadoop
 
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (...
 
Discover hdp 2.2 hdfs - final
Discover hdp 2.2   hdfs - finalDiscover hdp 2.2   hdfs - final
Discover hdp 2.2 hdfs - final
 
Discover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.finalDiscover.hdp2.2.storm and kafka.final
Discover.hdp2.2.storm and kafka.final
 
Presentation
PresentationPresentation
Presentation
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop training by keylabs
Hadoop training by keylabsHadoop training by keylabs
Hadoop training by keylabs
 

Recently uploaded

Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 

Hadoop for System Administrators

  • 1. for System Administrators – Hadoop for System Administrators O –h iOo hLiion uLxi nFuexs tF 2e0s1t 42014 Justin Miller Senior Systems Engineer/DevOps at iHealth Technologies Weston Bassler Systems Engineer at Verizon Wireless
  • 2. Hadoop for System Administrators – Ohio Linux Fest 2014 What we will be covering: Intro Why Hadoop? How Hadoop Works Architecture Planning Hardware/Storage/Network Processing and Storage HDFS Components YARN Components Operations Job scheduling Jobs alerts Monitoring Core Services Job scheduler and SLA Hardware High Availability YARN HDFS Oozie Security Security Issues Authentication Authorization Encrption Backup and Recovery What to plan for? How to combat Hadoop Vendors/Distros Cloudera HortonWorks MapR
  • 3. Hadoop for System Administrators – Ohio Linux Fest 2014 Why Hadoop?
  • 4. Hadoop for System Administrators – Ohio Linux Fest 2014 Why Hadoop? Cont... Sort through TB, even PB worth of data in a matter of minutes Easily sift through LOGS (patterns, data mining) → switch logs, application logs Batch Processing History → Inspired by 2 Google Papers on MapReduce and GoogleFS Implemented By Yahoo!
  • 5. Hadoop for System Administrators – Ohio Linux Fest 2014 Whose using it?
  • 6. Hadoop for System Administrators – Ohio Linux Fest 2014 How Hadoop? Processing • MapReduce (MRv1) What is MapReduce? Nobody likes it • YARN (MRv2) Yet Another Resource Negotiator Newer better/versatile 2 New Roles → Resource Manager and Application Manager Spark → New Hotness • Bringing Processing and Storage together Data locality → avoid network! “MO NODES MO BETTA”
  • 7. Hadoop for System Administrators – Ohio Linux Fest 2014 YARN in Action
  • 8. Hadoop for System Administrators – Ohio Linux Fest 2014 Storage • HDFS What is HDFS? Why HDFS? • Components of HDFS NameNode Metadata → fsimage + fsedits ZooKeeper → HA management Quorum based journaling 3 JournalNodes Active/Passive NameNode DataNodes – what do they do? Blocks in relation to NameNode Metadata Block storage
  • 9. Hadoop for System Administrators – Ohio Linux Fest 2014 HDFS Write Path
  • 10. Hadoop for System Administrators – Ohio Linux Fest 2014 Benefits and Limitations of HDFS Benefits Low cost per byte → commodity storage High Bandwidth/Scales effectively → “Mo nodes Mo speed” Rock solid data reliability Supports distributed computing I/O patterns OPEN SOURCE!!!!!
  • 11. Hadoop for System Administrators – Ohio Linux Fest 2014 Benefits and Limitations of HDFS (Continued...) Limitations Updates → data is immutable (can't be updated only appended) Write Once Optimized for sequential reads → not for real-time data processing Challenging import/export → requires additional tooling
  • 12. Hadoop for System Administrators – Ohio Linux Fest 2014 Architectur e • Planning your Hardware/Storage Cheap disks Distributed disk approach → replication factor of 3 for HA NO LVM and NO Raid and NO swap noatime, nodiratime • Network considerations Rack awareness affects data distribution Prefer a faster network when available → 10GB if possible
  • 13. Hadoop for System Administrators – Ohio Linux Fest 2014 Hadoop Operations • Jobs What is a job? Scheduling jobs with Oozie Alerts on Jobs Oozie SLAs → Start time, end time & duration File driven Job Configuration
  • 14. Hadoop for System Administrators – Ohio Linux Fest 2014 Example of a Job: Example of a coordinator:
  • 15. Hadoop for System Administrators – Ohio Linux Fest 2014 Troubleshooting • Application → Debug Code
  • 16. Hadoop for System Administrators – Ohio Linux Fest 2014 • Job → Debug Execution
  • 17. Hadoop for System Administrators – Ohio Linux Fest 2014 • Service → Debug Linux Process (/var/log/hadoop-*) Services wont start → port conflicts (nmap, netstat, lsof) if not application OR job; do cat /var/log/hadoop-* | grep ERROR done
  • 18. Hadoop for System Administrators – Ohio Linux Fest 2014 Monitoring • Core Services HDFS YARN JMX → JVM Monitoring Cloudera Manager • Performance Ganglia (HortonWorks) Cloudera Manager • Hardware → to each his own (traditional monitoring) SNMP Nagios Zenoss Cloudera Manager
  • 19. Hadoop for System Administrators – Ohio Linux Fest 2014 High Availability • HDFS ZooKeeper → quorum based journaling • YARN ZooKeeper
  • 20. Hadoop for System Administrators – Ohio Linux Fest 2014 • Oozie HA
  • 21. Hadoop for System Administrators – Ohio Linux Fest 2014 Security (Because people are evil)
  • 22. Hadoop for System Administrators – Ohio Linux Fest 2014 Security Continued.... • Known issues – Stupid/Lazy People Hadoop can be very secure • Authentication - Kerberos Principal (user) Realm (group of principals) Keytab file • Authorization LDAP Active Directory Role based • Encryption – For your eyes Only! Kerberos 1st SSL Certificates **** SSL must be enabled for all core Hadoop services
  • 23. Hadoop for System Administrators – Ohio Linux Fest 2014 Backup and Recovery – When things go wrong (And they will) What can go wrong? What to plan for? Data Corruption Node crashes Disk crashes Ways to combat when things do go wrong • Data Corruption checksums of metadata fail → NameNode replaces with fresh HDFS → hdfs fsck tool • Node crashes/Disk crashes HDFS saves the day! NameNode HA First 2 replicas of data on different hosts Heartbeat detection
  • 24. Hadoop for System Administrators – Ohio Linux Fest 2014 Hadoop Wars - Vendors and Distributions • Cloudera Specializes in Enterprise tools Auditing Access Control Cluster Management (Cloudera Manager) • HortonWorks Specializes in Engineering Also Open Source Top new cool things • MapR Lead developers begin Mahout
  • 25. Hadoop for System Administrators – Ohio Linux Fest 2014 Hopefully you enjoyed! Slide Share Link: http://www.slideshare.net/mageru/hadoop-for-sysadmin If interested: Quick Ways to get started Learning Hadoop • Free Stuff – Who doesn't like free? Big Data University – Hadoop fundamentals, Pig, Oozie, lots more Udactity – Intro to Hadoop and Mapreduce MapR, Cloudera, HortonWorks – Training Videos