Hadoop, Taming Elephants


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop, Taming Elephants

  1. 1. Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
  2. 2. About @odimulescu• Working on the Web since 1997• Into startup and engineering cultures• Speaker at user groups, code camps• Founder and organizer for JaxMUG.com• Organizer for Jax Big Data meetup
  3. 3. Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
  4. 4. What is ?• Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware• Created by Doug Cutting (Lucene & Nutch creator)• Named after Doug’s son’s toy elephant
  5. 5. What and how is solving?• Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed file system• Moves computation to data rather then data to computation• Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
  6. 6. Why does it matter?• Volume, Velocity, Variety and Value• Datasets do not fit on local HDDs let alone RAM• Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
  7. 7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM* Chart Source: IDC White Paper HR Records
  8. 8. Why does it matter?• Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster• Low $/TB for commodity drives• Low-end servers are multicore capable
  9. 9. Use cases• ETL - Extract Transform Load• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
  10. 10. Who uses it?
  11. 11. Who supports it?
  12. 12. What is Hadoop not?• Not a database replacement• Not a data warehousing (complements it)• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion
  13. 13. Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the cluster.Map-ReduceProgramming model for processing and generatinglarge data sets.
  14. 14. HDFS - Design• Files are stored as blocks (64MB default size)• Configurable data replication (3x, Rack Aware*)• Fault Tolerant, Expects HW failures• HUGE files, Expects Streaming not Low Latency• Mostly WORM• Not POSIX compliant• Not mountable OOTB*
  15. 15. HDFS - Architecture Namenode (NN)Client ask NN for file HNN returns DNs that Dhost it FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Reads / Write blocks to / from clients• Controls read/write to files • Replicates blocks at master’s request• Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
  16. 16. HDFS - Fault tolerance• DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)*• NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
  17. 17. MapReduce - ArchitectureClient launches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received• Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
  18. 18. Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
  19. 19. Hadoop 2.0 - HDFS Architecture• Distributed Namespace• Multiple Block Pools
  20. 20. Hadoop 2.0 - YARN Architecture
  21. 21. MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_programAny – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_outPig Latin, Hive HQL, C via JNI
  22. 22. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
  23. 23. Installation - PlatformsProduction Linux – OfficialDevelopment Linux OSX Windows via Cygwin *Nix
  24. 24. Installation - VersionsPublic Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha versionDevelopment Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
  25. 25. Installation - For toyingOption I - Official project releases hadoop.apache.org/common/releases.htmlOption 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapROption 3 - Cloud • Amazon’s EMR • Hadoop on Azure
  26. 26. Installation - For realVendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7Hosted solutions • AWS EMR • Hadoop on AzureUse Virtualization - VMware Serengeti *
  27. 27. Security - Simple Mode• Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user• Reasonable protection for accidental misuse• Simple to setup
  28. 28. Security - Secure Mode• Kerberos based• Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal• Use a dedicated MIT KDC• Hook it to your primary KDC (AD, etc.)• Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
  29. 29. MonitoringBuilt-in • JMX • REST • No SNMP supportOther Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
  30. 30. Demo
  31. 31. Questions ?
  32. 32. ReferencesHadoop Operations, by Eric SammerHadoop Security, by Hortonworks BlogHDFS Federation, by Suresh SrinivasHadoop 2.0 New Features, by VertiCloud IncMapReduce in Simple Terms, by Saliya EkanayakeHadoop Architecture, by Phillipe Julio