What is ?• Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware• Created by Doug Cutting (Lucene & Nutch creator)• Named after Doug’s son’s toy elephant
What and how is solving?• Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed ﬁle system• Moves computation to data rather then data to computation• Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
Why does it matter?• Volume, Velocity, Variety and Value• Datasets do not ﬁt on local HDDs let alone RAM• Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Proﬁles CRM* Chart Source: IDC White Paper HR Records
Why does it matter?• Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster• Low $/TB for commodity drives• Low-end servers are multicore capable
Use cases• ETL - Extract Transform Load• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
Who uses it?
Who supports it?
What is Hadoop not?• Not a database replacement• Not a data warehousing (complements it)• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion
Architecture – Core ComponentsHDFSDistributed ﬁlesystem designed for low cost storageand high bandwidth access across the cluster.Map-ReduceProgramming model for processing and generatinglarge data sets.
HDFS - Design• Files are stored as blocks (64MB default size)• Conﬁgurable data replication (3x, Rack Aware*)• Fault Tolerant, Expects HW failures• HUGE ﬁles, Expects Streaming not Low Latency• Mostly WORM• Not POSIX compliant• Not mountable OOTB*
HDFS - Architecture Namenode (NN)Client ask NN for ﬁle HNN returns DNs that Dhost it FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Reads / Write blocks to / from clients• Controls read/write to ﬁles • Replicates blocks at master’s request• Manages blocks replication • Notiﬁes master about block-ids Single Namespace Single Block Pool
HDFS - Fault tolerance• DataNode Uses CRC32 to avoid corruption Data is replicated on other nodes (3x)*• NameNode fsimage - last snapshot edits - changes log since last snapshot Checkpoint Node Backup NameNode Failover is manual*
MapReduce - ArchitectureClient launches a job J JobsTracker (JT) - Conﬁguration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received• Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
Security - Simple Mode• Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user• Reasonable protection for accidental misuse• Simple to setup
Security - Secure Mode• Kerberos based• Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal• Use a dedicated MIT KDC• Hook it to your primary KDC (AD, etc.)• Signiﬁcant setup effort (users, groups and Kerberos keys on all nodes, etc.)
MonitoringBuilt-in • JMX • REST • No SNMP supportOther Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
ReferencesHadoop Operations, by Eric SammerHadoop Security, by Hortonworks BlogHDFS Federation, by Suresh SrinivasHadoop 2.0 New Features, by VertiCloud IncMapReduce in Simple Terms, by Saliya EkanayakeHadoop Architecture, by Phillipe Julio