SlideShare a Scribd company logo
Hadoop Inside


         TC 데이터플랫폼실 GFIS팀
                    이은조
What is Hadoop
 Hadoop is a Framework & System for
    parallel processing of
    large amounts of data in
    a distributed computing environment
           http://searchbusinessintelligence.techtarget.in/tutorial/Apache-Hadoop-FAQ-for-BI-professionals




 Apache project
    open source
    java based
    google system clone
        GFS -> HDFS
        MapReduce -> MapReduce
Distributed Processing System
 How to process data in distributed environment
    how to read/write data
    how to control nodes
    load balancing
 Monitoring
    node status
    task status
 Fault tolerance
    error detection
        process error, network error, hardware error, …
    error handling
        temporary error: retry -> duplication, data corruption, …
        permanent error: fail over(which one?)
        process hang: timeout & retry
            • too long -> long response time
            • too short -> infinite loop
Hadoop System Architecture

HDFS + MapReduce

                                                                                Secondary
                        Job                              Name
                                                                                  Name
                      Tracker                            Node
                                                                                  Node




   Task      Data                Task        Data                Task         Data
  Tracker    Node               Tracker      Node               Tracker       Node




   : Node           : Process             : Heart Beat            : Data Read/Write
HDFS
 vs. Filesystem
    inode – namespace
    cylinder / track – data node
    blocks(bytes) – blocks(Mbytes)
 Features
    very large files
    write once, read many times
    support for usual file system operations
         ls, cp, mv, rm, chmod, chown, put, cat, …
    no support for multiple writers or arbitrary modifications
Block Replication & Rack Awareness


        1       2
                                    1   2             1        3
        3       4
                                    1   3             2        4


                                    3   4             4        2
1           1           2
                                2
    1
                        2

3                   4
            3               4               : File        : Server
    3                   4
                                            : Block       : Rack
HDFS - Read

Data Read

                              1. Read Request
                                                       Name
               Client
                                                       Node
                              2. Response


  3. Reqeust            4. Read
     Data                  Data




        Data Node                               Data Node               Data Node




     : Node                  : Data Block            : Data I/O   : Operation Message
HDFS - Write

Data Write

                              1. Write Request
                                                        Name
               Client
                                                        Node
                              2. Response


    3. Write            5. Write
       Data                Done




        Data Node            4. Write            Data Node         4. Write           Data Node
                                Replica                               Replica




     : Node                   : Data Block            : Data I/O                : Operation Message
HDFS – Write (Failure)

Data Write

                              1. Write Request
                                                        Name
               Client
                                                        Node
                              2. Response


    3. Write            5. Write
       Data                Done




        Data Node                                Data Node               Data Node




                             4. Write
                                Replica




     : Node                   : Data Block            : Data I/O   : Operation Message
HDFS – Write (Failure)

Data Write


                                             Name                         Data Node
              Client
                                             Node

                                                            Replica
                                                            Arrangement


                                            Delete                              Write
                                            Partial block                       Replica


        Data Node                     Data Node                           Data Node




     : Node            : Data Block        : Data I/O              : Operation Message
MapReduce
 Definition
    map: (+1) [ 1, 2, 3, 4, …, 10 ] -> [ 2, 3, 4, 5, …, 11 ]
    reduce: (+) [ 2, 3, 4, 5, …, 11 ] -> 65
 Programming Model for processing data sets in Hadoop
    projection, filter -> map task
    aggregation, join -> reduce task
    sort -> partitioning
 Job Tracker & Task Trackers
    master / slave
    job = many tasks
        # of map tasks = # of file splits (default: # of blocks)
        # of reduce tasks = user configuration
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
MapReduce
Map / Reduce Task




   : Distributed File System   : Map Task              : Map Output Record (Key/Value pair)

   : Split                     : Reduce Task           : Reduce Output Record (Key/Value pair)

   : Input Data Record         : Shuffling & Sorting   : Partition
Mapper - partitioning
    double indexed structure

  Output Buffer    key         value     key        value      …           key       value
(default: 100Mb)


      1st Index    partition   key      value     partition   key      value     …
                               offset   offset                offset   offset



      2nd Index    key         key       key        ….
                   offset      offset    offset


    Spill Thread
       data sorting: 2nd index (quick sort)
       spill file generating
            spill data file & index file
       flush
            merge sort (by key) per partition
Reducer –fetching
 GetMapEventsThread
    map event listener
 MapOutputCopier
    data fetching from completed mapper (HTTP)
    concurrent running in some threads
 Merger
    key sorting (heap sort)

                     completion        Job     completion events
                     events          Tracker


       TaskTracker
                                                             TaskTracker (reduce task)
        (map task)

                                  HTTP - GET               Copier
       TaskTracker
        (map task)                                                                Reducer
                                                           Copier

       TaskTracker
        (map task)
Job Flow
                                                    JobTracker Node

                                                                      5. add job
                                                           Job
                                    3. submit job        Tracker




Client Node                                                                          6. heartbeat
                                                              4. retrieve
                                                                 input spilts
  MapReduce 1. runJob       Job                                                                           Task
   Program                 Client                                                                        Tracker         7. assign task

                                                           Shared
                                                        File System             8. retrieve                  9. launch
                                    2. copy job
                                       resources                                   job resources


                                                                                                          Child


                                                                             11. read data/                  10. run
                                                                                 write result
   : Node          : Job Queue                 : Job
                                                                                                          Map/
                                                                                                         Reduce
   : JVM           : Method Call               : Task                                                     Task


   : Class         : I/O                                                                            TaskTracker Node
Monitoring
 Heart beat
    task tracker status checking
    task request / alignment
    other commands (restart, shudown, kill task, …)
 Cluster Status
 Job / Task Status
    JobInProgress
    TaskInProgress
 Reporter & Metrics
 Black list
Monitoring (Summary)
 Heart beat
    task tracker status checking
    task request / alignment
    other commands (restart, shudown, kill task, …)
 Cluster Status
 Job / Task Status
    JobInProgress
    TaskInProgress
 Reporter & Metrics
 Black list
Monitoring (Cluster Info)
Monitoring (Job Info)
Monitoring (Task Info)
Task Scheduler
 job queue
    red-black tree ( java.util.TreeMap)
    sort by priority & job id (request time)
 load factor
    remain tasks / capacity
 task alignment
    high priority
    new task > speculative execution task > dummy splits task
    map task (local) > map task (non-local) > reduce task
 padding
    padding = MIN(total tasks * pad faction, task capacity)
    for speculative execution
Error Handling
 Retry
    configurable (default 4 times)
 Timeout
    configurable
 Speculative Execution
    current – start >= 1 minute
    average progress – progress > 20%
Distributed Processing System
 How to process data in distributed environment
    how to read/write data
    how to control nodes
    load balancing
 Monitoring
    node status
    task status
 Fault tolerance
    error detection
        process error, network error, hardware error, …
    error handling
        temporary error: retry -> duplication, data corruption, …
        permanent error: fail over(which one?)
        process hang: timeout & retry
            • too long -> long response time
            • too short -> infinite loop
Distributed Processing System
 How to process data in distributed environment
    how to read/write data
    how to control nodes                       HDFS Client
    load balancing                            master / slave
 Monitoring                           replication / rack awareness
    node status                               job scheduler
    task status
 Fault tolerance
    error detection
        process error, network error, hardware error, …
    error handling
        temporary error: retry -> duplication, data corruption, …
        permanent error: fail over(which one?)
        process hang: timeout & retry
            • too long -> long response time
            • too short -> infinite loop
Distributed Processing System
 How to process data in distributed environment
    how to read/write data
    how to control nodes
    load balancing
 Monitoring
    node status
                                              heart beat
    task status
                                            job/task status
 Fault tolerance                         reporter / metrics
    error detection
        process error, network error, hardware error, …
    error handling
        temporary error: retry -> duplication, data corruption, …
        permanent error: fail over(which one?)
        process hang: timeout & retry
            • too long -> long response time
            • too short -> infinite loop
Distributed Processing System
 How to process data in distributed environment
    how to read/write data
    how to control nodes
    load balancing
 Monitoring
    node status
                                               black list
    task status                           time out & retry
 Fault tolerance                        speculative execution
    error detection
        process error, network error, hardware error, …
    error handling
        temporary error: retry -> duplication, data corruption, …
        permanent error: fail over(which one?)
        process hang: timeout & retry
            • too long -> long response time
            • too short -> infinite loop
Limitations
 map -> reduce network overhead
    iterative processing
    full(or theta) join
 small size but many splits data
 Low latency
    polling & pulling
    job initializing
    optimized for throughput
        job scheduling
        data access
Q&A

More Related Content

What's hot

Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
DataWorks Summit
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Yahoo!デベロッパーネットワーク
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
DataWorks Summit
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Aravind Babu
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
JigsawAcademy2014
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Cloudera, Inc.
 
Hadoop
HadoopHadoop
Reading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmerReading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmer
Chad Cooper
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
Chien Chung Shen
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
Shivraj Raj
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and Answers
Sprintzeal
 
Nov 2011 HUG: HParser
Nov 2011 HUG: HParserNov 2011 HUG: HParser
Nov 2011 HUG: HParser
Yahoo Developer Network
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Building a Linux IPv6 DNS Server Project review PPT v3.0 First review
Building a Linux IPv6 DNS Server Project review PPT v3.0 First reviewBuilding a Linux IPv6 DNS Server Project review PPT v3.0 First review
Building a Linux IPv6 DNS Server Project review PPT v3.0 First review
Hari
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
StampedeCon
 
Experiences on Processing Spatial Data with MapReduce ssdbm09
Experiences on Processing Spatial Data with MapReduce ssdbm09Experiences on Processing Spatial Data with MapReduce ssdbm09
Experiences on Processing Spatial Data with MapReduce ssdbm09
lghost1201
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performance
Jim Dowling
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
acogoluegnes
 

What's hot (19)

Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Integrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment AnalysisIntegrating R & Hadoop - Text Mining & Sentiment Analysis
Integrating R & Hadoop - Text Mining & Sentiment Analysis
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
Hadoop
HadoopHadoop
Hadoop
 
Reading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmerReading and writing spatial data for the non-spatial programmer
Reading and writing spatial data for the non-spatial programmer
 
Hadoop Essential for Oracle Professionals
Hadoop Essential for Oracle ProfessionalsHadoop Essential for Oracle Professionals
Hadoop Essential for Oracle Professionals
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Most Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and AnswersMost Popular Hadoop Interview Questions and Answers
Most Popular Hadoop Interview Questions and Answers
 
Nov 2011 HUG: HParser
Nov 2011 HUG: HParserNov 2011 HUG: HParser
Nov 2011 HUG: HParser
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Building a Linux IPv6 DNS Server Project review PPT v3.0 First review
Building a Linux IPv6 DNS Server Project review PPT v3.0 First reviewBuilding a Linux IPv6 DNS Server Project review PPT v3.0 First review
Building a Linux IPv6 DNS Server Project review PPT v3.0 First review
 
Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012Facebook's HBase Backups - StampedeCon 2012
Facebook's HBase Backups - StampedeCon 2012
 
Experiences on Processing Spatial Data with MapReduce ssdbm09
Experiences on Processing Spatial Data with MapReduce ssdbm09Experiences on Processing Spatial Data with MapReduce ssdbm09
Experiences on Processing Spatial Data with MapReduce ssdbm09
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performance
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 

Similar to Hadoop Inside

Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
Jared Winick
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
fvanvollenhoven
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
Jazan University
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
Sean Murphy
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
MindsMapped Consulting
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
Sreenu Musham
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
IndicThreads
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFShuguk
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
Rommel Garcia
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
Nitin Khattar
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
its_skm
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
fvanvollenhoven
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 

Similar to Hadoop Inside (20)

Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Introduction to Apache Accumulo
Introduction to Apache AccumuloIntroduction to Apache Accumulo
Introduction to Apache Accumulo
 
RuG Guest Lecture
RuG Guest LectureRuG Guest Lecture
RuG Guest Lecture
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Presentation sreenu dwh-services
Presentation sreenu dwh-servicesPresentation sreenu dwh-services
Presentation sreenu dwh-services
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFS
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
Hadoop fundamentals
Hadoop fundamentalsHadoop fundamentals
Hadoop fundamentals
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 

More from Eun-Jo Lee

No Silk Road for Online Gamers
No Silk Road for Online GamersNo Silk Road for Online Gamers
No Silk Road for Online Gamers
Eun-Jo Lee
 
회귀모델의 종류와 특징
회귀모델의 종류와 특징회귀모델의 종류와 특징
회귀모델의 종류와 특징
Eun-Jo Lee
 
데이터분석을통한게임유저모델링
데이터분석을통한게임유저모델링데이터분석을통한게임유저모델링
데이터분석을통한게임유저모델링
Eun-Jo Lee
 
Ndss 2016 game_bot_final_no_video
Ndss 2016 game_bot_final_no_videoNdss 2016 game_bot_final_no_video
Ndss 2016 game_bot_final_no_video
Eun-Jo Lee
 
Data analysis for game fraud detection
Data analysis for game fraud detectionData analysis for game fraud detection
Data analysis for game fraud detectionEun-Jo Lee
 
탐사분석을통한작업장탐지
탐사분석을통한작업장탐지탐사분석을통한작업장탐지
탐사분석을통한작업장탐지Eun-Jo Lee
 
R을 이용한 게임 데이터 분석
R을 이용한 게임 데이터 분석R을 이용한 게임 데이터 분석
R을 이용한 게임 데이터 분석Eun-Jo Lee
 

More from Eun-Jo Lee (7)

No Silk Road for Online Gamers
No Silk Road for Online GamersNo Silk Road for Online Gamers
No Silk Road for Online Gamers
 
회귀모델의 종류와 특징
회귀모델의 종류와 특징회귀모델의 종류와 특징
회귀모델의 종류와 특징
 
데이터분석을통한게임유저모델링
데이터분석을통한게임유저모델링데이터분석을통한게임유저모델링
데이터분석을통한게임유저모델링
 
Ndss 2016 game_bot_final_no_video
Ndss 2016 game_bot_final_no_videoNdss 2016 game_bot_final_no_video
Ndss 2016 game_bot_final_no_video
 
Data analysis for game fraud detection
Data analysis for game fraud detectionData analysis for game fraud detection
Data analysis for game fraud detection
 
탐사분석을통한작업장탐지
탐사분석을통한작업장탐지탐사분석을통한작업장탐지
탐사분석을통한작업장탐지
 
R을 이용한 게임 데이터 분석
R을 이용한 게임 데이터 분석R을 이용한 게임 데이터 분석
R을 이용한 게임 데이터 분석
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 

Hadoop Inside

  • 1. Hadoop Inside TC 데이터플랫폼실 GFIS팀 이은조
  • 2. What is Hadoop  Hadoop is a Framework & System for  parallel processing of  large amounts of data in  a distributed computing environment http://searchbusinessintelligence.techtarget.in/tutorial/Apache-Hadoop-FAQ-for-BI-professionals  Apache project  open source  java based  google system clone  GFS -> HDFS  MapReduce -> MapReduce
  • 3. Distributed Processing System  How to process data in distributed environment  how to read/write data  how to control nodes  load balancing  Monitoring  node status  task status  Fault tolerance  error detection  process error, network error, hardware error, …  error handling  temporary error: retry -> duplication, data corruption, …  permanent error: fail over(which one?)  process hang: timeout & retry • too long -> long response time • too short -> infinite loop
  • 4. Hadoop System Architecture HDFS + MapReduce Secondary Job Name Name Tracker Node Node Task Data Task Data Task Data Tracker Node Tracker Node Tracker Node : Node : Process : Heart Beat : Data Read/Write
  • 5. HDFS  vs. Filesystem  inode – namespace  cylinder / track – data node  blocks(bytes) – blocks(Mbytes)  Features  very large files  write once, read many times  support for usual file system operations  ls, cp, mv, rm, chmod, chown, put, cat, …  no support for multiple writers or arbitrary modifications
  • 6. Block Replication & Rack Awareness 1 2 1 2 1 3 3 4 1 3 2 4 3 4 4 2 1 1 2 2 1 2 3 4 3 4 : File : Server 3 4 : Block : Rack
  • 7. HDFS - Read Data Read 1. Read Request Name Client Node 2. Response 3. Reqeust 4. Read Data Data Data Node Data Node Data Node : Node : Data Block : Data I/O : Operation Message
  • 8. HDFS - Write Data Write 1. Write Request Name Client Node 2. Response 3. Write 5. Write Data Done Data Node 4. Write Data Node 4. Write Data Node Replica Replica : Node : Data Block : Data I/O : Operation Message
  • 9. HDFS – Write (Failure) Data Write 1. Write Request Name Client Node 2. Response 3. Write 5. Write Data Done Data Node Data Node Data Node 4. Write Replica : Node : Data Block : Data I/O : Operation Message
  • 10. HDFS – Write (Failure) Data Write Name Data Node Client Node Replica Arrangement Delete Write Partial block Replica Data Node Data Node Data Node : Node : Data Block : Data I/O : Operation Message
  • 11. MapReduce  Definition  map: (+1) [ 1, 2, 3, 4, …, 10 ] -> [ 2, 3, 4, 5, …, 11 ]  reduce: (+) [ 2, 3, 4, 5, …, 11 ] -> 65  Programming Model for processing data sets in Hadoop  projection, filter -> map task  aggregation, join -> reduce task  sort -> partitioning  Job Tracker & Task Trackers  master / slave  job = many tasks  # of map tasks = # of file splits (default: # of blocks)  # of reduce tasks = user configuration
  • 12. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 13. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 14. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 15. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 16. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 17. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 18. MapReduce Map / Reduce Task : Distributed File System : Map Task : Map Output Record (Key/Value pair) : Split : Reduce Task : Reduce Output Record (Key/Value pair) : Input Data Record : Shuffling & Sorting : Partition
  • 19. Mapper - partitioning  double indexed structure Output Buffer key value key value … key value (default: 100Mb) 1st Index partition key value partition key value … offset offset offset offset 2nd Index key key key …. offset offset offset  Spill Thread  data sorting: 2nd index (quick sort)  spill file generating  spill data file & index file  flush  merge sort (by key) per partition
  • 20. Reducer –fetching  GetMapEventsThread  map event listener  MapOutputCopier  data fetching from completed mapper (HTTP)  concurrent running in some threads  Merger  key sorting (heap sort) completion Job completion events events Tracker TaskTracker TaskTracker (reduce task) (map task) HTTP - GET Copier TaskTracker (map task) Reducer Copier TaskTracker (map task)
  • 21. Job Flow JobTracker Node 5. add job Job 3. submit job Tracker Client Node 6. heartbeat 4. retrieve input spilts MapReduce 1. runJob Job Task Program Client Tracker 7. assign task Shared File System 8. retrieve 9. launch 2. copy job resources job resources Child 11. read data/ 10. run write result : Node : Job Queue : Job Map/ Reduce : JVM : Method Call : Task Task : Class : I/O TaskTracker Node
  • 22. Monitoring  Heart beat  task tracker status checking  task request / alignment  other commands (restart, shudown, kill task, …)  Cluster Status  Job / Task Status  JobInProgress  TaskInProgress  Reporter & Metrics  Black list
  • 23. Monitoring (Summary)  Heart beat  task tracker status checking  task request / alignment  other commands (restart, shudown, kill task, …)  Cluster Status  Job / Task Status  JobInProgress  TaskInProgress  Reporter & Metrics  Black list
  • 27. Task Scheduler  job queue  red-black tree ( java.util.TreeMap)  sort by priority & job id (request time)  load factor  remain tasks / capacity  task alignment  high priority  new task > speculative execution task > dummy splits task  map task (local) > map task (non-local) > reduce task  padding  padding = MIN(total tasks * pad faction, task capacity)  for speculative execution
  • 28. Error Handling  Retry  configurable (default 4 times)  Timeout  configurable  Speculative Execution  current – start >= 1 minute  average progress – progress > 20%
  • 29. Distributed Processing System  How to process data in distributed environment  how to read/write data  how to control nodes  load balancing  Monitoring  node status  task status  Fault tolerance  error detection  process error, network error, hardware error, …  error handling  temporary error: retry -> duplication, data corruption, …  permanent error: fail over(which one?)  process hang: timeout & retry • too long -> long response time • too short -> infinite loop
  • 30. Distributed Processing System  How to process data in distributed environment  how to read/write data  how to control nodes HDFS Client  load balancing master / slave  Monitoring replication / rack awareness  node status job scheduler  task status  Fault tolerance  error detection  process error, network error, hardware error, …  error handling  temporary error: retry -> duplication, data corruption, …  permanent error: fail over(which one?)  process hang: timeout & retry • too long -> long response time • too short -> infinite loop
  • 31. Distributed Processing System  How to process data in distributed environment  how to read/write data  how to control nodes  load balancing  Monitoring  node status heart beat  task status job/task status  Fault tolerance reporter / metrics  error detection  process error, network error, hardware error, …  error handling  temporary error: retry -> duplication, data corruption, …  permanent error: fail over(which one?)  process hang: timeout & retry • too long -> long response time • too short -> infinite loop
  • 32. Distributed Processing System  How to process data in distributed environment  how to read/write data  how to control nodes  load balancing  Monitoring  node status black list  task status time out & retry  Fault tolerance speculative execution  error detection  process error, network error, hardware error, …  error handling  temporary error: retry -> duplication, data corruption, …  permanent error: fail over(which one?)  process hang: timeout & retry • too long -> long response time • too short -> infinite loop
  • 33. Limitations  map -> reduce network overhead  iterative processing  full(or theta) join  small size but many splits data  Low latency  polling & pulling  job initializing  optimized for throughput  job scheduling  data access
  • 34. Q&A