• Save
Hadoop, Taming Elephants
Upcoming SlideShare
Loading in...5

Hadoop, Taming Elephants






Total Views
Views on SlideShare
Embed Views



1 Embed 4

https://twitter.com 4



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Hadoop, Taming Elephants Hadoop, Taming Elephants Presentation Transcript

    • Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
    • About @odimulescu• Working on the Web since 1997• Into startup and engineering cultures• Speaker at user groups, code camps• Founder and organizer for JaxMUG.com• Organizer for Jax Big Data meetup
    • Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
    • What is ?• Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware• Created by Doug Cutting (Lucene & Nutch creator)• Named after Doug’s son’s toy elephant
    • What and how is solving?• Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed file system• Moves computation to data rather then data to computation• Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
    • Why does it matter?• Volume, Velocity, Variety and Value• Datasets do not fit on local HDDs let alone RAM• Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
    • Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM* Chart Source: IDC White Paper HR Records
    • Why does it matter?• Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster• Low $/TB for commodity drives• Low-end servers are multicore capable
    • Use cases• ETL - Extract Transform Load• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
    • Who uses it?
    • Who supports it?
    • What is Hadoop not?• Not a database replacement• Not a data warehousing (complements it)• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion
    • Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the cluster.Map-ReduceProgramming model for processing and generatinglarge data sets.
    • HDFS - Design• Files are stored as blocks (64MB default size)• Configurable data replication (3x, Rack Aware*)• Fault Tolerant, Expects HW failures• HUGE files, Expects Streaming not Low Latency• Mostly WORM• Not POSIX compliant• Not mountable OOTB*
    • HDFS - Architecture Namenode (NN)Client ask NN for file HNN returns DNs that Dhost it FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Reads / Write blocks to / from clients• Controls read/write to files • Replicates blocks at master’s request• Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
    • HDFS - Fault tolerance• DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)*• NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
    • MapReduce - ArchitectureClient launches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received• Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
    • Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
    • Hadoop 2.0 - HDFS Architecture• Distributed Namespace• Multiple Block Pools
    • Hadoop 2.0 - YARN Architecture
    • MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_programAny – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_outPig Latin, Hive HQL, C via JNI
    • Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
    • Installation - PlatformsProduction Linux – OfficialDevelopment Linux OSX Windows via Cygwin *Nix
    • Installation - VersionsPublic Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha versionDevelopment Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
    • Installation - For toyingOption I - Official project releases hadoop.apache.org/common/releases.htmlOption 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapROption 3 - Cloud • Amazon’s EMR • Hadoop on Azure
    • Installation - For realVendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7Hosted solutions • AWS EMR • Hadoop on AzureUse Virtualization - VMware Serengeti *
    • Security - Simple Mode• Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user• Reasonable protection for accidental misuse• Simple to setup
    • Security - Secure Mode• Kerberos based• Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal• Use a dedicated MIT KDC• Hook it to your primary KDC (AD, etc.)• Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
    • MonitoringBuilt-in • JMX • REST • No SNMP supportOther Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
    • Demo
    • Questions ?
    • ReferencesHadoop Operations, by Eric SammerHadoop Security, by Hortonworks BlogHDFS Federation, by Suresh SrinivasHadoop 2.0 New Features, by VertiCloud IncMapReduce in Simple Terms, by Saliya EkanayakeHadoop Architecture, by Phillipe Julio