Your SlideShare is downloading. ×
Hadoop, Taming Elephants
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop, Taming Elephants

541

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
541
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
  • 2. About @odimulescu• Working on the Web since 1997• Into startup and engineering cultures• Speaker at user groups, code camps• Founder and organizer for JaxMUG.com• Organizer for Jax Big Data meetup
  • 3. Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
  • 4. What is ?• Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware• Created by Doug Cutting (Lucene & Nutch creator)• Named after Doug’s son’s toy elephant
  • 5. What and how is solving?• Processing diverse large datasets in practical time at low cost• Consolidates data in a distributed file system• Moves computation to data rather then data to computation• Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
  • 6. Why does it matter?• Volume, Velocity, Variety and Value• Datasets do not fit on local HDDs let alone RAM• Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
  • 7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM* Chart Source: IDC White Paper HR Records
  • 8. Why does it matter?• Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster• Low $/TB for commodity drives• Low-end servers are multicore capable
  • 9. Use cases• ETL - Extract Transform Load• Pattern Recognition• Recommendation Engines• Prediction Models• Log Processing• Data “sandbox”
  • 10. Who uses it?
  • 11. Who supports it?
  • 12. What is Hadoop not?• Not a database replacement• Not a data warehousing (complements it)• Not for interactive reporting• Not a general purpose storage mechanism• Not for problems that are not parallelizable in a share-nothing fashion
  • 13. Architecture – Core ComponentsHDFSDistributed filesystem designed for low cost storageand high bandwidth access across the cluster.Map-ReduceProgramming model for processing and generatinglarge data sets.
  • 14. HDFS - Design• Files are stored as blocks (64MB default size)• Configurable data replication (3x, Rack Aware*)• Fault Tolerant, Expects HW failures• HUGE files, Expects Streaming not Low Latency• Mostly WORM• Not POSIX compliant• Not mountable OOTB*
  • 15. HDFS - Architecture Namenode (NN)Client ask NN for file HNN returns DNs that Dhost it FClient ask DN for data S Datanode 1 Datanode 2 Datanode NNamenode - Master Datanode - Slaves• Filesystem metadata • Reads / Write blocks to / from clients• Controls read/write to files • Replicates blocks at master’s request• Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
  • 16. HDFS - Fault tolerance• DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)*• NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
  • 17. MapReduce - ArchitectureClient launches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker NJobTracker - Master TaskTracker - Slaves• Accepts MR jobs submitted by clients • Run Map and Reduce tasks received• Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of• Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure• Speculative execution
  • 18. Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode* Mini OS: Filesystem & Scheduler
  • 19. Hadoop 2.0 - HDFS Architecture• Distributed Namespace• Multiple Block Pools
  • 20. Hadoop 2.0 - YARN Architecture
  • 21. MapReduce - ClientsJava - Native hadoop jar jar_path main_class input_path output_pathC++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_programAny – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_outPig Latin, Hive HQL, C via JNI
  • 22. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
  • 23. Installation - PlatformsProduction Linux – OfficialDevelopment Linux OSX Windows via Cygwin *Nix
  • 24. Installation - VersionsPublic Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha versionDevelopment Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
  • 25. Installation - For toyingOption I - Official project releases hadoop.apache.org/common/releases.htmlOption 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapROption 3 - Cloud • Amazon’s EMR • Hadoop on Azure
  • 26. Installation - For realVendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7Hosted solutions • AWS EMR • Hadoop on AzureUse Virtualization - VMware Serengeti *
  • 27. Security - Simple Mode• Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user• Reasonable protection for accidental misuse• Simple to setup
  • 28. Security - Secure Mode• Kerberos based• Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal• Use a dedicated MIT KDC• Hook it to your primary KDC (AD, etc.)• Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
  • 29. MonitoringBuilt-in • JMX • REST • No SNMP supportOther Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
  • 30. Demo
  • 31. Questions ?
  • 32. ReferencesHadoop Operations, by Eric SammerHadoop Security, by Hortonworks BlogHDFS Federation, by Suresh SrinivasHadoop 2.0 New Features, by VertiCloud IncMapReduce in Simple Terms, by Saliya EkanayakeHadoop Architecture, by Phillipe Julio

×