Hadoop, Taming Elephants

Ovidiu Dimulescu
Ovidiu DimulescuSecurity Engineering, Oracle Cloud at Oracle
Hadoop, Taming elephants
      JaxLUG, 2013

      Ovidiu Dimulescu
About @odimulescu
• Working on the Web since 1997
• Into startup and engineering cultures
• Speaker at user groups, code camps
• Founder and organizer for JaxMUG.com
• Organizer for Jax Big Data meetup
Agenda
  •   Background
  •   Architecture v1.0 & 2.0
  •   Ecosystem
  •   Installation
  •   Security
  •   Monitoring
  •   Demo
  •   Q &A
What is                         ?
• Apache Hadoop is an open source Java software
  framework for running data-intensive applications on
  large clusters of commodity hardware

• Created by Doug Cutting (Lucene & Nutch creator)

• Named after Doug’s son’s toy elephant
What and how is solving?
• Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed file system
• Moves computation to data rather then data to computation
• Simpler programming model



                                            CPU
                       CPU

                                            CPU
                       CPU

                                            CPU
                       CPU

                       CPU                  CPU
Why does it matter?
• Volume, Velocity, Variety and Value

• Datasets do not fit on local HDDs let alone RAM

• Scaling up

   ‣ Is expensive (licensing, hardware, etc.)
   ‣ Has a ceiling (physical, technical, etc.)
Why does it matter?

           Data types             Complex Data

                                     Images,Video
            20%                      Logs
                                     Documents
                                     Call records
                                     Sensor data
                       80%           Mail archives

                                  Structured Data
                 Complex
                 Structured          User Profiles
                                     CRM
* Chart Source: IDC White Paper      HR Records
Why does it matter?

• Scanning 10TB at sustained transfer of 75MB/s takes

   ~2 days on 1 node

   ~5 hrs on 10 nodes cluster

• Low $/TB for commodity drives

• Low-end servers are multicore capable
Use cases

• ETL - Extract Transform Load

• Pattern Recognition

• Recommendation Engines

• Prediction Models

• Log Processing

• Data “sandbox”
Who uses it?
Who supports it?
What is Hadoop not?

• Not a database replacement

• Not a data warehousing (complements it)

• Not for interactive reporting

• Not a general purpose storage mechanism

• Not for problems that are not parallelizable in a
  share-nothing fashion
Architecture – Core Components

HDFS

Distributed filesystem designed for low cost storage
and high bandwidth access across the cluster.


Map-Reduce

Programming model for processing and generating
large data sets.
HDFS - Design

•   Files are stored as blocks (64MB default size)

•   Configurable data replication (3x, Rack Aware*)

•   Fault Tolerant, Expects HW failures

•   HUGE files, Expects Streaming not Low Latency

•   Mostly WORM

•   Not POSIX compliant

•   Not mountable OOTB*
HDFS - Architecture


                                                 Namenode (NN)
Client ask NN for file    H
NN returns DNs that      D
host it                  F
Client ask DN for data
                         S
                                Datanode 1         Datanode 2          Datanode N



Namenode - Master                            Datanode - Slaves

•     Filesystem metadata                    •     Reads / Write blocks to / from clients
•     Controls read/write to files            •     Replicates blocks at master’s request
•     Manages blocks replication             •     Notifies master about block-ids


                                Single Namespace
                                Single Block Pool
HDFS - Fault tolerance

•   DataNode

         Uses CRC32 to avoid corruption
         Data is replicated on other nodes (3x)*

•   NameNode

         fsimage - last snapshot
         edits - changes log since last snapshot
         Checkpoint Node
         Backup NameNode
         Failover is manual*
MapReduce - Architecture

Client launches a job   J                     JobsTracker (JT)
  - Configuration
                        O
  - Mapper              B
  - Reducer             S
  - Input
  - Output
                        API   TaskTracker 1    TaskTracker 2     TaskTracker N



JobTracker - Master                        TaskTracker - Slaves

• Accepts MR jobs submitted by clients     • Run Map and Reduce tasks received
• Assigns Map and Reduce tasks to            from Jobtracker
  TaskTrackers                             • Manage storage and transmission of
• Monitors tasks and TaskTracker status,     intermediate output
  re-executes tasks upon failure
• Speculative execution
Hadoop - Core Architecture


    J                     JobsTracker
    O
    B
    S
          TaskTracker 1   TaskTracker 2   TaskTracker N
    API
          DataNode   1    DataNode   2    DataNode   N
                                                          H
                                                          D
                                                          F
                                                          S
                          NameNode




* Mini OS: Filesystem & Scheduler
Hadoop 2.0 - HDFS Architecture




• Distributed Namespace
• Multiple Block Pools
Hadoop 2.0 - YARN Architecture
MapReduce - Clients

Java - Native
 hadoop jar jar_path main_class input_path output_path


C++ - Pipes framework
 hadoop pipes -input path_in -output path_out -program exec_program


Any – Streaming
 hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
 input path_in -output path_out


Pig Latin, Hive HQL, C via JNI
Hadoop - Ecosystem

                    Management

 ZooKeeper      Chukwa          Ambari          HUE

                     Data Access

  Pig        Hive       Flume         Impala    Sqoop

                    Data Processing
 MapReduce     Giraph     Hama        Mahaout    MPI

                      Storage
        HDFS                           HBase
Installation - Platforms

Production
    Linux – Official

Development
   Linux
   OSX
   Windows via Cygwin
   *Nix
Installation - Versions

Public Numbering

 1.0.x - current stable version
 1.1.x - current beta version for 1.x branch
 2.X - current alpha version

Development Numbering

 0.20.x aka 1.x - CDH 3 & HDP 1
 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
Installation - For toying

Option I - Official project releases
     hadoop.apache.org/common/releases.html

Option 2 - Demo VM from vendors
     •   Cloudera
     •   Hortonworks
     •   Greenplum
     •   MapR

Option 3 - Cloud
     • Amazon’s EMR
     • Hadoop on Azure
Installation - For real

Vendor distributions
   •   Cloudera CDH
   •   Hortonworks HDP
   •   Greenplum GPHD
   •   MapR M3, M5 or M7

Hosted solutions

   •   AWS EMR
   •   Hadoop on Azure

Use Virtualization - VMware Serengeti *
Security - Simple Mode

• Use in a trusted environment
  ‣   Identity comes from euid of the client process
  ‣   MapReduce tasks run as the TaskTracker user
  ‣   User that starts the NameNode is super-user

• Reasonable protection for accidental misuse
• Simple to setup
Security - Secure Mode

• Kerberos based
• Use for tight granular access
    ‣   Identity comes from Kerberos Principal
    ‣   MapReduce tasks run as Kerberos Principal

•   Use a dedicated MIT KDC

•   Hook it to your primary KDC (AD, etc.)

•   Significant setup effort (users, groups and Kerberos keys
    on all nodes, etc.)
Monitoring

Built-in

  • JMX
  • REST
  • No SNMP support
Other

  Cloudera Manager (Free up to 50 nodes)
  Ambari - Free, RPM based systems (RH, CentOS)
Demo
Questions ?
References
Hadoop Operations, by Eric Sammer
Hadoop Security, by Hortonworks Blog

HDFS Federation, by Suresh Srinivas

Hadoop 2.0 New Features, by VertiCloud Inc

MapReduce in Simple Terms, by Saliya Ekanayake

Hadoop Architecture, by Phillipe Julio
1 of 32

Recommended

What's New and Upcoming in HDFS - the Hadoop Distributed File System by
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemWhat's New and Upcoming in HDFS - the Hadoop Distributed File System
What's New and Upcoming in HDFS - the Hadoop Distributed File SystemCloudera, Inc.
7.1K views49 slides
Strata + Hadoop World 2012: HDFS: Now and Future by
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureCloudera, Inc.
3K views24 slides
Hadoop on Azure, Blue elephants by
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
1.4K views39 slides
Nn ha hadoop world.final by
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.finalHortonworks
2K views18 slides
Apache Hadoop 0.22 and Other Versions by
Apache Hadoop 0.22 and Other VersionsApache Hadoop 0.22 and Other Versions
Apache Hadoop 0.22 and Other VersionsKonstantin V. Shvachko
4K views20 slides
Ambari Meetup: NameNode HA by
Ambari Meetup: NameNode HAAmbari Meetup: NameNode HA
Ambari Meetup: NameNode HAHortonworks
2.9K views6 slides

More Related Content

What's hot

How to Increase Performance of Your Hadoop Cluster by
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
7K views25 slides
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ... by
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Odinot Stanislas
7.9K views53 slides
Introduction to hadoop and hdfs by
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
6.5K views27 slides
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i... by
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...Daehyeok Kim
115 views26 slides
Optimizing your Infrastrucure and Operating System for Hadoop by
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
4.7K views21 slides
Hadoop Architecture_Cluster_Cap_Plan by
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
278 views14 slides

What's hot(20)

How to Increase Performance of Your Hadoop Cluster by Altoros
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
Altoros7K views
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ... by Odinot Stanislas
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Odinot Stanislas7.9K views
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i... by Daehyeok Kim
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
Daehyeok Kim115 views
Optimizing your Infrastrucure and Operating System for Hadoop by DataWorks Summit
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
DataWorks Summit4.7K views
Hadoop Architecture_Cluster_Cap_Plan by Narayana B
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
Narayana B278 views
Hadoop Operations for Production Systems (Strata NYC) by Kathleen Ting
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting1.1K views
Advanced Hadoop Tuning and Optimization - Hadoop Consulting by Impetus Technologies
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies23.4K views
Introduction to hadoop high availability by Omid Vahdaty
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
Omid Vahdaty406 views
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application by Yahoo Developer Network
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
PhillyDB Talk - Beyond Batch by boorad
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
boorad1.7K views
Hadoop configuration & performance tuning by Vitthal Gogate
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate41.3K views
Improving Hadoop Performance via Linux by Alex Moundalexis
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
Alex Moundalexis15.5K views
Hadoop Cluster With High Availability by Edureka!
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
Edureka!2.4K views

Similar to Hadoop, Taming Elephants

Introduction to Hadoop by
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopOvidiu Dimulescu
1K views46 slides
Hadoop ppt1 by
Hadoop ppt1Hadoop ppt1
Hadoop ppt1chariorienit
586 views53 slides
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013) by
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
4.9K views71 slides
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur. by
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
17 views19 slides
TriHUG - Beyond Batch by
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
1.7K views54 slides
Introduction to Hadoop Administration by
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop AdministrationRamesh Pabba - seeking new projects
340 views57 slides

Similar to Hadoop, Taming Elephants(20)

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013) by VMware Tanzu
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu4.9K views
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur. by MaharajothiP
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP17 views
TriHUG - Beyond Batch by boorad
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
boorad1.7K views
Hadoop-Quick introduction by Sandeep Singh
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh679 views
Hadoop Operations - Best practices from the field by Uwe Printz
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
Uwe Printz3.1K views
Deploying Grid Services Using Hadoop by George Ang
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
George Ang607 views
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop by Leons Petražickis
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
Hadoop Fundamentals by its_skm
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
its_skm1.5K views
Large scale computing with mapreduce by hansen3032
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
hansen3032580 views

More from Ovidiu Dimulescu

Microservices - Yet another buzzword by
Microservices - Yet another buzzwordMicroservices - Yet another buzzword
Microservices - Yet another buzzwordOvidiu Dimulescu
789 views49 slides
Threads Needles Stacks Heaps - Java edition by
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionOvidiu Dimulescu
4.2K views70 slides
Journeyman to Master by
Journeyman to MasterJourneyman to Master
Journeyman to MasterOvidiu Dimulescu
501 views43 slides
The Rise of DevOps by
The Rise of DevOpsThe Rise of DevOps
The Rise of DevOpsOvidiu Dimulescu
550 views33 slides
Git for Windows by
Git for WindowsGit for Windows
Git for WindowsOvidiu Dimulescu
2.4K views30 slides
Node.js, toy or power tool? by
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?Ovidiu Dimulescu
4.9K views37 slides

More from Ovidiu Dimulescu(9)

Recently uploaded

Info Session November 2023.pdf by
Info Session November 2023.pdfInfo Session November 2023.pdf
Info Session November 2023.pdfAleksandraKoprivica4
10 views15 slides
Perth MeetUp November 2023 by
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023 Michael Price
15 views44 slides
AMAZON PRODUCT RESEARCH.pdf by
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdfJerikkLaureta
15 views13 slides
Web Dev - 1 PPT.pdf by
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdfgdsczhcet
55 views45 slides
SAP Automation Using Bar Code and FIORI.pdf by
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdfVirendra Rai, PMP
19 views38 slides
handbook for web 3 adoption.pdf by
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdfLiveplex
19 views16 slides

Recently uploaded(20)

Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023 Perth MeetUp November 2023
Perth MeetUp November 2023
Michael Price15 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdfAMAZON PRODUCT RESEARCH.pdf
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta15 views
Web Dev - 1 PPT.pdf by gdsczhcet
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet55 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
handbook for web 3 adoption.pdf by Liveplex
handbook for web 3 adoption.pdfhandbook for web 3 adoption.pdf
handbook for web 3 adoption.pdf
Liveplex19 views
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson33 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb12 views
6g - REPORT.pdf by Liveplex
6g - REPORT.pdf6g - REPORT.pdf
6g - REPORT.pdf
Liveplex9 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software225 views
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec11 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn19 views
Black and White Modern Science Presentation.pptx by maryamkhalid2916
Black and White Modern Science Presentation.pptxBlack and White Modern Science Presentation.pptx
Black and White Modern Science Presentation.pptx
maryamkhalid291614 views

Hadoop, Taming Elephants

  • 1. Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
  • 2. About @odimulescu • Working on the Web since 1997 • Into startup and engineering cultures • Speaker at user groups, code camps • Founder and organizer for JaxMUG.com • Organizer for Jax Big Data meetup
  • 3. Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
  • 4. What is ? • Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware • Created by Doug Cutting (Lucene & Nutch creator) • Named after Doug’s son’s toy elephant
  • 5. What and how is solving? • Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
  • 6. Why does it matter? • Volume, Velocity, Variety and Value • Datasets do not fit on local HDDs let alone RAM • Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
  • 7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM * Chart Source: IDC White Paper HR Records
  • 8. Why does it matter? • Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster • Low $/TB for commodity drives • Low-end servers are multicore capable
  • 9. Use cases • ETL - Extract Transform Load • Pattern Recognition • Recommendation Engines • Prediction Models • Log Processing • Data “sandbox”
  • 12. What is Hadoop not? • Not a database replacement • Not a data warehousing (complements it) • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion
  • 13. Architecture – Core Components HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. Map-Reduce Programming model for processing and generating large data sets.
  • 14. HDFS - Design • Files are stored as blocks (64MB default size) • Configurable data replication (3x, Rack Aware*) • Fault Tolerant, Expects HW failures • HUGE files, Expects Streaming not Low Latency • Mostly WORM • Not POSIX compliant • Not mountable OOTB*
  • 15. HDFS - Architecture Namenode (NN) Client ask NN for file H NN returns DNs that D host it F Client ask DN for data S Datanode 1 Datanode 2 Datanode N Namenode - Master Datanode - Slaves • Filesystem metadata • Reads / Write blocks to / from clients • Controls read/write to files • Replicates blocks at master’s request • Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
  • 16. HDFS - Fault tolerance • DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)* • NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
  • 17. MapReduce - Architecture Client launches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker N JobTracker - Master TaskTracker - Slaves • Accepts MR jobs submitted by clients • Run Map and Reduce tasks received • Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of • Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure • Speculative execution
  • 18. Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode * Mini OS: Filesystem & Scheduler
  • 19. Hadoop 2.0 - HDFS Architecture • Distributed Namespace • Multiple Block Pools
  • 20. Hadoop 2.0 - YARN Architecture
  • 21. MapReduce - Clients Java - Native hadoop jar jar_path main_class input_path output_path C++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_program Any – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out Pig Latin, Hive HQL, C via JNI
  • 22. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
  • 23. Installation - Platforms Production Linux – Official Development Linux OSX Windows via Cygwin *Nix
  • 24. Installation - Versions Public Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha version Development Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
  • 25. Installation - For toying Option I - Official project releases hadoop.apache.org/common/releases.html Option 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapR Option 3 - Cloud • Amazon’s EMR • Hadoop on Azure
  • 26. Installation - For real Vendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7 Hosted solutions • AWS EMR • Hadoop on Azure Use Virtualization - VMware Serengeti *
  • 27. Security - Simple Mode • Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user • Reasonable protection for accidental misuse • Simple to setup
  • 28. Security - Secure Mode • Kerberos based • Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal • Use a dedicated MIT KDC • Hook it to your primary KDC (AD, etc.) • Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
  • 29. Monitoring Built-in • JMX • REST • No SNMP support Other Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
  • 30. Demo
  • 32. References Hadoop Operations, by Eric Sammer Hadoop Security, by Hortonworks Blog HDFS Federation, by Suresh Srinivas Hadoop 2.0 New Features, by VertiCloud Inc MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio