Hadoop, Taming elephants
      JaxLUG, 2013

      Ovidiu Dimulescu
About @odimulescu
• Working on the Web since 1997
• Into startup and engineering cultures
• Speaker at user groups, code camps
• Founder and organizer for JaxMUG.com
• Organizer for Jax Big Data meetup
Agenda
  •   Background
  •   Architecture v1.0 & 2.0
  •   Ecosystem
  •   Installation
  •   Security
  •   Monitoring
  •   Demo
  •   Q &A
What is                         ?
• Apache Hadoop is an open source Java software
  framework for running data-intensive applications on
  large clusters of commodity hardware

• Created by Doug Cutting (Lucene & Nutch creator)

• Named after Doug’s son’s toy elephant
What and how is solving?
• Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed file system
• Moves computation to data rather then data to computation
• Simpler programming model



                                            CPU
                       CPU

                                            CPU
                       CPU

                                            CPU
                       CPU

                       CPU                  CPU
Why does it matter?
• Volume, Velocity, Variety and Value

• Datasets do not fit on local HDDs let alone RAM

• Scaling up

   ‣ Is expensive (licensing, hardware, etc.)
   ‣ Has a ceiling (physical, technical, etc.)
Why does it matter?

           Data types             Complex Data

                                     Images,Video
            20%                      Logs
                                     Documents
                                     Call records
                                     Sensor data
                       80%           Mail archives

                                  Structured Data
                 Complex
                 Structured          User Profiles
                                     CRM
* Chart Source: IDC White Paper      HR Records
Why does it matter?

• Scanning 10TB at sustained transfer of 75MB/s takes

   ~2 days on 1 node

   ~5 hrs on 10 nodes cluster

• Low $/TB for commodity drives

• Low-end servers are multicore capable
Use cases

• ETL - Extract Transform Load

• Pattern Recognition

• Recommendation Engines

• Prediction Models

• Log Processing

• Data “sandbox”
Who uses it?
Who supports it?
What is Hadoop not?

• Not a database replacement

• Not a data warehousing (complements it)

• Not for interactive reporting

• Not a general purpose storage mechanism

• Not for problems that are not parallelizable in a
  share-nothing fashion
Architecture – Core Components

HDFS

Distributed filesystem designed for low cost storage
and high bandwidth access across the cluster.


Map-Reduce

Programming model for processing and generating
large data sets.
HDFS - Design

•   Files are stored as blocks (64MB default size)

•   Configurable data replication (3x, Rack Aware*)

•   Fault Tolerant, Expects HW failures

•   HUGE files, Expects Streaming not Low Latency

•   Mostly WORM

•   Not POSIX compliant

•   Not mountable OOTB*
HDFS - Architecture


                                                 Namenode (NN)
Client ask NN for file    H
NN returns DNs that      D
host it                  F
Client ask DN for data
                         S
                                Datanode 1         Datanode 2          Datanode N



Namenode - Master                            Datanode - Slaves

•     Filesystem metadata                    •     Reads / Write blocks to / from clients
•     Controls read/write to files            •     Replicates blocks at master’s request
•     Manages blocks replication             •     Notifies master about block-ids


                                Single Namespace
                                Single Block Pool
HDFS - Fault tolerance

•   DataNode

         Uses CRC32 to avoid corruption
         Data is replicated on other nodes (3x)*

•   NameNode

         fsimage - last snapshot
         edits - changes log since last snapshot
         Checkpoint Node
         Backup NameNode
         Failover is manual*
MapReduce - Architecture

Client launches a job   J                     JobsTracker (JT)
  - Configuration
                        O
  - Mapper              B
  - Reducer             S
  - Input
  - Output
                        API   TaskTracker 1    TaskTracker 2     TaskTracker N



JobTracker - Master                        TaskTracker - Slaves

• Accepts MR jobs submitted by clients     • Run Map and Reduce tasks received
• Assigns Map and Reduce tasks to            from Jobtracker
  TaskTrackers                             • Manage storage and transmission of
• Monitors tasks and TaskTracker status,     intermediate output
  re-executes tasks upon failure
• Speculative execution
Hadoop - Core Architecture


    J                     JobsTracker
    O
    B
    S
          TaskTracker 1   TaskTracker 2   TaskTracker N
    API
          DataNode   1    DataNode   2    DataNode   N
                                                          H
                                                          D
                                                          F
                                                          S
                          NameNode




* Mini OS: Filesystem & Scheduler
Hadoop 2.0 - HDFS Architecture




• Distributed Namespace
• Multiple Block Pools
Hadoop 2.0 - YARN Architecture
MapReduce - Clients

Java - Native
 hadoop jar jar_path main_class input_path output_path


C++ - Pipes framework
 hadoop pipes -input path_in -output path_out -program exec_program


Any – Streaming
 hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
 input path_in -output path_out


Pig Latin, Hive HQL, C via JNI
Hadoop - Ecosystem

                    Management

 ZooKeeper      Chukwa          Ambari          HUE

                     Data Access

  Pig        Hive       Flume         Impala    Sqoop

                    Data Processing
 MapReduce     Giraph     Hama        Mahaout    MPI

                      Storage
        HDFS                           HBase
Installation - Platforms

Production
    Linux – Official

Development
   Linux
   OSX
   Windows via Cygwin
   *Nix
Installation - Versions

Public Numbering

 1.0.x - current stable version
 1.1.x - current beta version for 1.x branch
 2.X - current alpha version

Development Numbering

 0.20.x aka 1.x - CDH 3 & HDP 1
 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
Installation - For toying

Option I - Official project releases
     hadoop.apache.org/common/releases.html

Option 2 - Demo VM from vendors
     •   Cloudera
     •   Hortonworks
     •   Greenplum
     •   MapR

Option 3 - Cloud
     • Amazon’s EMR
     • Hadoop on Azure
Installation - For real

Vendor distributions
   •   Cloudera CDH
   •   Hortonworks HDP
   •   Greenplum GPHD
   •   MapR M3, M5 or M7

Hosted solutions

   •   AWS EMR
   •   Hadoop on Azure

Use Virtualization - VMware Serengeti *
Security - Simple Mode

• Use in a trusted environment
  ‣   Identity comes from euid of the client process
  ‣   MapReduce tasks run as the TaskTracker user
  ‣   User that starts the NameNode is super-user

• Reasonable protection for accidental misuse
• Simple to setup
Security - Secure Mode

• Kerberos based
• Use for tight granular access
    ‣   Identity comes from Kerberos Principal
    ‣   MapReduce tasks run as Kerberos Principal

•   Use a dedicated MIT KDC

•   Hook it to your primary KDC (AD, etc.)

•   Significant setup effort (users, groups and Kerberos keys
    on all nodes, etc.)
Monitoring

Built-in

  • JMX
  • REST
  • No SNMP support
Other

  Cloudera Manager (Free up to 50 nodes)
  Ambari - Free, RPM based systems (RH, CentOS)
Demo
Questions ?
References
Hadoop Operations, by Eric Sammer
Hadoop Security, by Hortonworks Blog

HDFS Federation, by Suresh Srinivas

Hadoop 2.0 New Features, by VertiCloud Inc

MapReduce in Simple Terms, by Saliya Ekanayake

Hadoop Architecture, by Phillipe Julio

Hadoop, Taming Elephants

  • 1.
    Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
  • 2.
    About @odimulescu • Workingon the Web since 1997 • Into startup and engineering cultures • Speaker at user groups, code camps • Founder and organizer for JaxMUG.com • Organizer for Jax Big Data meetup
  • 3.
    Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
  • 4.
    What is ? • Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware • Created by Doug Cutting (Lucene & Nutch creator) • Named after Doug’s son’s toy elephant
  • 5.
    What and howis solving? • Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
  • 6.
    Why does itmatter? • Volume, Velocity, Variety and Value • Datasets do not fit on local HDDs let alone RAM • Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
  • 7.
    Why does itmatter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM * Chart Source: IDC White Paper HR Records
  • 8.
    Why does itmatter? • Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster • Low $/TB for commodity drives • Low-end servers are multicore capable
  • 9.
    Use cases • ETL- Extract Transform Load • Pattern Recognition • Recommendation Engines • Prediction Models • Log Processing • Data “sandbox”
  • 10.
  • 11.
  • 12.
    What is Hadoopnot? • Not a database replacement • Not a data warehousing (complements it) • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion
  • 13.
    Architecture – CoreComponents HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. Map-Reduce Programming model for processing and generating large data sets.
  • 14.
    HDFS - Design • Files are stored as blocks (64MB default size) • Configurable data replication (3x, Rack Aware*) • Fault Tolerant, Expects HW failures • HUGE files, Expects Streaming not Low Latency • Mostly WORM • Not POSIX compliant • Not mountable OOTB*
  • 15.
    HDFS - Architecture Namenode (NN) Client ask NN for file H NN returns DNs that D host it F Client ask DN for data S Datanode 1 Datanode 2 Datanode N Namenode - Master Datanode - Slaves • Filesystem metadata • Reads / Write blocks to / from clients • Controls read/write to files • Replicates blocks at master’s request • Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
  • 16.
    HDFS - Faulttolerance • DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)* • NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
  • 17.
    MapReduce - Architecture Clientlaunches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker N JobTracker - Master TaskTracker - Slaves • Accepts MR jobs submitted by clients • Run Map and Reduce tasks received • Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of • Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure • Speculative execution
  • 18.
    Hadoop - CoreArchitecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode * Mini OS: Filesystem & Scheduler
  • 19.
    Hadoop 2.0 -HDFS Architecture • Distributed Namespace • Multiple Block Pools
  • 20.
    Hadoop 2.0 -YARN Architecture
  • 21.
    MapReduce - Clients Java- Native hadoop jar jar_path main_class input_path output_path C++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_program Any – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out Pig Latin, Hive HQL, C via JNI
  • 22.
    Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
  • 23.
    Installation - Platforms Production Linux – Official Development Linux OSX Windows via Cygwin *Nix
  • 24.
    Installation - Versions PublicNumbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha version Development Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
  • 25.
    Installation - Fortoying Option I - Official project releases hadoop.apache.org/common/releases.html Option 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapR Option 3 - Cloud • Amazon’s EMR • Hadoop on Azure
  • 26.
    Installation - Forreal Vendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7 Hosted solutions • AWS EMR • Hadoop on Azure Use Virtualization - VMware Serengeti *
  • 27.
    Security - SimpleMode • Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user • Reasonable protection for accidental misuse • Simple to setup
  • 28.
    Security - SecureMode • Kerberos based • Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal • Use a dedicated MIT KDC • Hook it to your primary KDC (AD, etc.) • Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
  • 29.
    Monitoring Built-in •JMX • REST • No SNMP support Other Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
  • 30.
  • 31.
  • 32.
    References Hadoop Operations, byEric Sammer Hadoop Security, by Hortonworks Blog HDFS Federation, by Suresh Srinivas Hadoop 2.0 New Features, by VertiCloud Inc MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio