• Save
Hadoop, Taming Elephants
Upcoming SlideShare
Loading in...5
×
 

Hadoop, Taming Elephants

on

  • 882 views

 

Statistics

Views

Total Views
882
Views on SlideShare
878
Embed Views
4

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 4

https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop, Taming Elephants Hadoop, Taming Elephants Presentation Transcript

    • Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
    • About @odimulescu• Working on the Web since 1997• Into startup and engineering cultures• Speaker at user groups, code camps• Founder and organizer for JaxMUG.com• Organizer for Jax Big Data meetup
    • Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
    • What is ?• Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware• Created by Doug Cutting (Lucene & Nutch creator)• Named after Doug’s son’s toy elephant
    • What and how is solving?• Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed file system• Moves computation to data rather then data to computation• Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
    • Why does it matter?• Volume, Velocity, Variety and Value• Datasets do not fit on local HDDs let alone RAM• Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
    • Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM* Chart Source: IDC White Paper HR Records
    • Why does it matter?• Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster• Low $/TB for commodity drives• Low-end servers are multicore capable
    • Use cases• ETL - Extract Transform Load• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
    • Who uses it?
    • Who supports it?
    • What is Hadoop not?• Not a database replacement• Not a data warehousing (complements it)• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion
    • Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the cluster.Map-ReduceProgramming model for processing and generatinglarge data sets.
    • HDFS - Design• Files are stored as blocks (64MB default size)• Configurable data replication (3x, Rack Aware*)• Fault Tolerant, Expects HW failures• HUGE files, Expects Streaming not Low Latency• Mostly WORM• Not POSIX compliant• Not mountable OOTB*
    • HDFS - Architecture Namenode (NN)Client ask NN for file HNN returns DNs that Dhost it FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Reads / Write blocks to / from clients• Controls read/write to files • Replicates blocks at master’s request• Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
    • HDFS - Fault tolerance• DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)*• NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
    • MapReduce - ArchitectureClient launches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received• Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
    • Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
    • Hadoop 2.0 - HDFS Architecture• Distributed Namespace• Multiple Block Pools
    • Hadoop 2.0 - YARN Architecture
    • MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_programAny – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_outPig Latin, Hive HQL, C via JNI
    • Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
    • Installation - PlatformsProduction Linux – OfficialDevelopment Linux OSX Windows via Cygwin *Nix
    • Installation - VersionsPublic Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha versionDevelopment Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
    • Installation - For toyingOption I - Official project releases hadoop.apache.org/common/releases.htmlOption 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapROption 3 - Cloud • Amazon’s EMR • Hadoop on Azure
    • Installation - For realVendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7Hosted solutions • AWS EMR • Hadoop on AzureUse Virtualization - VMware Serengeti *
    • Security - Simple Mode• Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user• Reasonable protection for accidental misuse• Simple to setup
    • Security - Secure Mode• Kerberos based• Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal• Use a dedicated MIT KDC• Hook it to your primary KDC (AD, etc.)• Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
    • MonitoringBuilt-in • JMX • REST • No SNMP supportOther Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
    • Demo
    • Questions ?
    • ReferencesHadoop Operations, by Eric SammerHadoop Security, by Hortonworks BlogHDFS Federation, by Suresh SrinivasHadoop 2.0 New Features, by VertiCloud IncMapReduce in Simple Terms, by Saliya EkanayakeHadoop Architecture, by Phillipe Julio