SlideShare a Scribd company logo
1 of 32
Hadoop, Taming elephants
      JaxLUG, 2013

      Ovidiu Dimulescu
About @odimulescu
• Working on the Web since 1997
• Into startup and engineering cultures
• Speaker at user groups, code camps
• Founder and organizer for JaxMUG.com
• Organizer for Jax Big Data meetup
Agenda
  •   Background
  •   Architecture v1.0 & 2.0
  •   Ecosystem
  •   Installation
  •   Security
  •   Monitoring
  •   Demo
  •   Q &A
What is                         ?
• Apache Hadoop is an open source Java software
  framework for running data-intensive applications on
  large clusters of commodity hardware

• Created by Doug Cutting (Lucene & Nutch creator)

• Named after Doug’s son’s toy elephant
What and how is solving?
• Processing diverse large datasets in practical time at low cost
• Consolidates data in a distributed file system
• Moves computation to data rather then data to computation
• Simpler programming model



                                            CPU
                       CPU

                                            CPU
                       CPU

                                            CPU
                       CPU

                       CPU                  CPU
Why does it matter?
• Volume, Velocity, Variety and Value

• Datasets do not fit on local HDDs let alone RAM

• Scaling up

   ‣ Is expensive (licensing, hardware, etc.)
   ‣ Has a ceiling (physical, technical, etc.)
Why does it matter?

           Data types             Complex Data

                                     Images,Video
            20%                      Logs
                                     Documents
                                     Call records
                                     Sensor data
                       80%           Mail archives

                                  Structured Data
                 Complex
                 Structured          User Profiles
                                     CRM
* Chart Source: IDC White Paper      HR Records
Why does it matter?

• Scanning 10TB at sustained transfer of 75MB/s takes

   ~2 days on 1 node

   ~5 hrs on 10 nodes cluster

• Low $/TB for commodity drives

• Low-end servers are multicore capable
Use cases

• ETL - Extract Transform Load

• Pattern Recognition

• Recommendation Engines

• Prediction Models

• Log Processing

• Data “sandbox”
Who uses it?
Who supports it?
What is Hadoop not?

• Not a database replacement

• Not a data warehousing (complements it)

• Not for interactive reporting

• Not a general purpose storage mechanism

• Not for problems that are not parallelizable in a
  share-nothing fashion
Architecture – Core Components

HDFS

Distributed filesystem designed for low cost storage
and high bandwidth access across the cluster.


Map-Reduce

Programming model for processing and generating
large data sets.
HDFS - Design

•   Files are stored as blocks (64MB default size)

•   Configurable data replication (3x, Rack Aware*)

•   Fault Tolerant, Expects HW failures

•   HUGE files, Expects Streaming not Low Latency

•   Mostly WORM

•   Not POSIX compliant

•   Not mountable OOTB*
HDFS - Architecture


                                                 Namenode (NN)
Client ask NN for file    H
NN returns DNs that      D
host it                  F
Client ask DN for data
                         S
                                Datanode 1         Datanode 2          Datanode N



Namenode - Master                            Datanode - Slaves

•     Filesystem metadata                    •     Reads / Write blocks to / from clients
•     Controls read/write to files            •     Replicates blocks at master’s request
•     Manages blocks replication             •     Notifies master about block-ids


                                Single Namespace
                                Single Block Pool
HDFS - Fault tolerance

•   DataNode

         Uses CRC32 to avoid corruption
         Data is replicated on other nodes (3x)*

•   NameNode

         fsimage - last snapshot
         edits - changes log since last snapshot
         Checkpoint Node
         Backup NameNode
         Failover is manual*
MapReduce - Architecture

Client launches a job   J                     JobsTracker (JT)
  - Configuration
                        O
  - Mapper              B
  - Reducer             S
  - Input
  - Output
                        API   TaskTracker 1    TaskTracker 2     TaskTracker N



JobTracker - Master                        TaskTracker - Slaves

• Accepts MR jobs submitted by clients     • Run Map and Reduce tasks received
• Assigns Map and Reduce tasks to            from Jobtracker
  TaskTrackers                             • Manage storage and transmission of
• Monitors tasks and TaskTracker status,     intermediate output
  re-executes tasks upon failure
• Speculative execution
Hadoop - Core Architecture


    J                     JobsTracker
    O
    B
    S
          TaskTracker 1   TaskTracker 2   TaskTracker N
    API
          DataNode   1    DataNode   2    DataNode   N
                                                          H
                                                          D
                                                          F
                                                          S
                          NameNode




* Mini OS: Filesystem & Scheduler
Hadoop 2.0 - HDFS Architecture




• Distributed Namespace
• Multiple Block Pools
Hadoop 2.0 - YARN Architecture
MapReduce - Clients

Java - Native
 hadoop jar jar_path main_class input_path output_path


C++ - Pipes framework
 hadoop pipes -input path_in -output path_out -program exec_program


Any – Streaming
 hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog -
 input path_in -output path_out


Pig Latin, Hive HQL, C via JNI
Hadoop - Ecosystem

                    Management

 ZooKeeper      Chukwa          Ambari          HUE

                     Data Access

  Pig        Hive       Flume         Impala    Sqoop

                    Data Processing
 MapReduce     Giraph     Hama        Mahaout    MPI

                      Storage
        HDFS                           HBase
Installation - Platforms

Production
    Linux – Official

Development
   Linux
   OSX
   Windows via Cygwin
   *Nix
Installation - Versions

Public Numbering

 1.0.x - current stable version
 1.1.x - current beta version for 1.x branch
 2.X - current alpha version

Development Numbering

 0.20.x aka 1.x - CDH 3 & HDP 1
 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
Installation - For toying

Option I - Official project releases
     hadoop.apache.org/common/releases.html

Option 2 - Demo VM from vendors
     •   Cloudera
     •   Hortonworks
     •   Greenplum
     •   MapR

Option 3 - Cloud
     • Amazon’s EMR
     • Hadoop on Azure
Installation - For real

Vendor distributions
   •   Cloudera CDH
   •   Hortonworks HDP
   •   Greenplum GPHD
   •   MapR M3, M5 or M7

Hosted solutions

   •   AWS EMR
   •   Hadoop on Azure

Use Virtualization - VMware Serengeti *
Security - Simple Mode

• Use in a trusted environment
  ‣   Identity comes from euid of the client process
  ‣   MapReduce tasks run as the TaskTracker user
  ‣   User that starts the NameNode is super-user

• Reasonable protection for accidental misuse
• Simple to setup
Security - Secure Mode

• Kerberos based
• Use for tight granular access
    ‣   Identity comes from Kerberos Principal
    ‣   MapReduce tasks run as Kerberos Principal

•   Use a dedicated MIT KDC

•   Hook it to your primary KDC (AD, etc.)

•   Significant setup effort (users, groups and Kerberos keys
    on all nodes, etc.)
Monitoring

Built-in

  • JMX
  • REST
  • No SNMP support
Other

  Cloudera Manager (Free up to 50 nodes)
  Ambari - Free, RPM based systems (RH, CentOS)
Demo
Questions ?
References
Hadoop Operations, by Eric Sammer
Hadoop Security, by Hortonworks Blog

HDFS Federation, by Suresh Srinivas

Hadoop 2.0 New Features, by VertiCloud Inc

MapReduce in Simple Terms, by Saliya Ekanayake

Hadoop Architecture, by Phillipe Julio

More Related Content

What's hot

How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterAltoros
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Odinot Stanislas
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...Daehyeok Kim
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopDataWorks Summit
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and FutureDataWorks Summit
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Kathleen Ting
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 
Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability Omid Vahdaty
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationYahoo Developer Network
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batchboorad
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxAlex Moundalexis
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityEdureka!
 

What's hot (20)

How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
Scale-out Storage on Intel® Architecture Based Platforms: Characterizing and ...
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions i...
 
Optimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for HadoopOptimizing your Infrastrucure and Operating System for Hadoop
Optimizing your Infrastrucure and Operating System for Hadoop
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
ha_module5
ha_module5ha_module5
ha_module5
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
 
PhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond BatchPhillyDB Talk - Beyond Batch
PhillyDB Talk - Beyond Batch
 
MYSQL
MYSQLMYSQL
MYSQL
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Improving Hadoop Performance via Linux
Improving Hadoop Performance via LinuxImproving Hadoop Performance via Linux
Improving Hadoop Performance via Linux
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 

Similar to Hadoop, Taming Elephants

Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)VMware Tanzu
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.MaharajothiP
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batchboorad
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldUwe Printz
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopLeons Petražickis
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentalsits_skm
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around HadoopDataWorks Summit
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 

Similar to Hadoop, Taming Elephants (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
TriHUG - Beyond Batch
TriHUG - Beyond BatchTriHUG - Beyond Batch
TriHUG - Beyond Batch
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the fieldHadoop Operations - Best practices from the field
Hadoop Operations - Best practices from the field
 
Hadoop
HadoopHadoop
Hadoop
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and HadoopIOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
IOD 2013 - Crunch Big Data in the Cloud with IBM BigInsights and Hadoop
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 

More from Ovidiu Dimulescu

More from Ovidiu Dimulescu (9)

Microservices - Yet another buzzword
Microservices - Yet another buzzwordMicroservices - Yet another buzzword
Microservices - Yet another buzzword
 
Threads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java editionThreads Needles Stacks Heaps - Java edition
Threads Needles Stacks Heaps - Java edition
 
Journeyman to Master
Journeyman to MasterJourneyman to Master
Journeyman to Master
 
The Rise of DevOps
The Rise of DevOpsThe Rise of DevOps
The Rise of DevOps
 
Git for Windows
Git for WindowsGit for Windows
Git for Windows
 
Node.js, toy or power tool?
Node.js, toy or power tool?Node.js, toy or power tool?
Node.js, toy or power tool?
 
HTML5, are we there yet?
HTML5, are we there yet?HTML5, are we there yet?
HTML5, are we there yet?
 
Git SVN Migrate Reasons
Git SVN Migrate ReasonsGit SVN Migrate Reasons
Git SVN Migrate Reasons
 
Introduction to Git
Introduction to GitIntroduction to Git
Introduction to Git
 

Recently uploaded

Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...DianaGray10
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameKapil Thakar
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applicationsnooralam814309
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdfThe Good Food Institute
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxSatishbabu Gunukula
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTopCSSGallery
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024Brian Pichman
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024Brian Pichman
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechProduct School
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Muhammad Tiham Siddiqui
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveIES VE
 

Recently uploaded (20)

Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...Explore the UiPath Community and ways you can benefit on your journey to auto...
Explore the UiPath Community and ways you can benefit on your journey to auto...
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Flow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First FrameFlow Control | Block Size | ST Min | First Frame
Flow Control | Block Size | ST Min | First Frame
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
Graphene Quantum Dots-Based Composites for Biomedical Applications
Graphene Quantum Dots-Based Composites for  Biomedical ApplicationsGraphene Quantum Dots-Based Composites for  Biomedical Applications
Graphene Quantum Dots-Based Composites for Biomedical Applications
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf2024.03.12 Cost drivers of cultivated meat production.pdf
2024.03.12 Cost drivers of cultivated meat production.pdf
 
The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Oracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptxOracle Database 23c Security New Features.pptx
Oracle Database 23c Security New Features.pptx
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Top 10 Squarespace Development Companies
Top 10 Squarespace Development CompaniesTop 10 Squarespace Development Companies
Top 10 Squarespace Development Companies
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024AI Workshops at Computers In Libraries 2024
AI Workshops at Computers In Libraries 2024
 
CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024CyberSecurity - Computers In Libraries 2024
CyberSecurity - Computers In Libraries 2024
 
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - TechWebinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
Webinar: The Art of Prioritizing Your Product Roadmap by AWS Sr PM - Tech
 
Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)Trailblazer Community - Flows Workshop (Session 2)
Trailblazer Community - Flows Workshop (Session 2)
 
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES LiveKeep Your Finger on the Pulse of Your Building's Performance with IES Live
Keep Your Finger on the Pulse of Your Building's Performance with IES Live
 

Hadoop, Taming Elephants

  • 1. Hadoop, Taming elephants JaxLUG, 2013 Ovidiu Dimulescu
  • 2. About @odimulescu • Working on the Web since 1997 • Into startup and engineering cultures • Speaker at user groups, code camps • Founder and organizer for JaxMUG.com • Organizer for Jax Big Data meetup
  • 3. Agenda • Background • Architecture v1.0 & 2.0 • Ecosystem • Installation • Security • Monitoring • Demo • Q &A
  • 4. What is ? • Apache Hadoop is an open source Java software framework for running data-intensive applications on large clusters of commodity hardware • Created by Doug Cutting (Lucene & Nutch creator) • Named after Doug’s son’s toy elephant
  • 5. What and how is solving? • Processing diverse large datasets in practical time at low cost • Consolidates data in a distributed file system • Moves computation to data rather then data to computation • Simpler programming model CPU CPU CPU CPU CPU CPU CPU CPU
  • 6. Why does it matter? • Volume, Velocity, Variety and Value • Datasets do not fit on local HDDs let alone RAM • Scaling up ‣ Is expensive (licensing, hardware, etc.) ‣ Has a ceiling (physical, technical, etc.)
  • 7. Why does it matter? Data types Complex Data Images,Video 20% Logs Documents Call records Sensor data 80% Mail archives Structured Data Complex Structured User Profiles CRM * Chart Source: IDC White Paper HR Records
  • 8. Why does it matter? • Scanning 10TB at sustained transfer of 75MB/s takes ~2 days on 1 node ~5 hrs on 10 nodes cluster • Low $/TB for commodity drives • Low-end servers are multicore capable
  • 9. Use cases • ETL - Extract Transform Load • Pattern Recognition • Recommendation Engines • Prediction Models • Log Processing • Data “sandbox”
  • 12. What is Hadoop not? • Not a database replacement • Not a data warehousing (complements it) • Not for interactive reporting • Not a general purpose storage mechanism • Not for problems that are not parallelizable in a share-nothing fashion
  • 13. Architecture – Core Components HDFS Distributed filesystem designed for low cost storage and high bandwidth access across the cluster. Map-Reduce Programming model for processing and generating large data sets.
  • 14. HDFS - Design • Files are stored as blocks (64MB default size) • Configurable data replication (3x, Rack Aware*) • Fault Tolerant, Expects HW failures • HUGE files, Expects Streaming not Low Latency • Mostly WORM • Not POSIX compliant • Not mountable OOTB*
  • 15. HDFS - Architecture Namenode (NN) Client ask NN for file H NN returns DNs that D host it F Client ask DN for data S Datanode 1 Datanode 2 Datanode N Namenode - Master Datanode - Slaves • Filesystem metadata • Reads / Write blocks to / from clients • Controls read/write to files • Replicates blocks at master’s request • Manages blocks replication • Notifies master about block-ids Single Namespace Single Block Pool
  • 16. HDFS - Fault tolerance • DataNode  Uses CRC32 to avoid corruption  Data is replicated on other nodes (3x)* • NameNode  fsimage - last snapshot  edits - changes log since last snapshot  Checkpoint Node  Backup NameNode  Failover is manual*
  • 17. MapReduce - Architecture Client launches a job J JobsTracker (JT) - Configuration O - Mapper B - Reducer S - Input - Output API TaskTracker 1 TaskTracker 2 TaskTracker N JobTracker - Master TaskTracker - Slaves • Accepts MR jobs submitted by clients • Run Map and Reduce tasks received • Assigns Map and Reduce tasks to from Jobtracker TaskTrackers • Manage storage and transmission of • Monitors tasks and TaskTracker status, intermediate output re-executes tasks upon failure • Speculative execution
  • 18. Hadoop - Core Architecture J JobsTracker O B S TaskTracker 1 TaskTracker 2 TaskTracker N API DataNode 1 DataNode 2 DataNode N H D F S NameNode * Mini OS: Filesystem & Scheduler
  • 19. Hadoop 2.0 - HDFS Architecture • Distributed Namespace • Multiple Block Pools
  • 20. Hadoop 2.0 - YARN Architecture
  • 21. MapReduce - Clients Java - Native hadoop jar jar_path main_class input_path output_path C++ - Pipes framework hadoop pipes -input path_in -output path_out -program exec_program Any – Streaming hadoop jar hadoop-streaming.jar -mapper map_prog -reducer reduce_prog - input path_in -output path_out Pig Latin, Hive HQL, C via JNI
  • 22. Hadoop - Ecosystem Management ZooKeeper Chukwa Ambari HUE Data Access Pig Hive Flume Impala Sqoop Data Processing MapReduce Giraph Hama Mahaout MPI Storage HDFS HBase
  • 23. Installation - Platforms Production Linux – Official Development Linux OSX Windows via Cygwin *Nix
  • 24. Installation - Versions Public Numbering 1.0.x - current stable version 1.1.x - current beta version for 1.x branch 2.X - current alpha version Development Numbering 0.20.x aka 1.x - CDH 3 & HDP 1 0.23.x aka 2.x - CDH 4 & HDP 2 (alpha)
  • 25. Installation - For toying Option I - Official project releases hadoop.apache.org/common/releases.html Option 2 - Demo VM from vendors • Cloudera • Hortonworks • Greenplum • MapR Option 3 - Cloud • Amazon’s EMR • Hadoop on Azure
  • 26. Installation - For real Vendor distributions • Cloudera CDH • Hortonworks HDP • Greenplum GPHD • MapR M3, M5 or M7 Hosted solutions • AWS EMR • Hadoop on Azure Use Virtualization - VMware Serengeti *
  • 27. Security - Simple Mode • Use in a trusted environment ‣ Identity comes from euid of the client process ‣ MapReduce tasks run as the TaskTracker user ‣ User that starts the NameNode is super-user • Reasonable protection for accidental misuse • Simple to setup
  • 28. Security - Secure Mode • Kerberos based • Use for tight granular access ‣ Identity comes from Kerberos Principal ‣ MapReduce tasks run as Kerberos Principal • Use a dedicated MIT KDC • Hook it to your primary KDC (AD, etc.) • Significant setup effort (users, groups and Kerberos keys on all nodes, etc.)
  • 29. Monitoring Built-in • JMX • REST • No SNMP support Other Cloudera Manager (Free up to 50 nodes) Ambari - Free, RPM based systems (RH, CentOS)
  • 30. Demo
  • 32. References Hadoop Operations, by Eric Sammer Hadoop Security, by Hortonworks Blog HDFS Federation, by Suresh Srinivas Hadoop 2.0 New Features, by VertiCloud Inc MapReduce in Simple Terms, by Saliya Ekanayake Hadoop Architecture, by Phillipe Julio