SlideShare a Scribd company logo
Introduction to Hadoop
                 23 Feb 2012, Sydney

                                 Mark Fei
                          mark.fei@cloudera.com




Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   201109   01-1
Agenda

During this talk, we’ll cover:
§  Why Hadoop is needed
§  The basic concepts of HDFS and MapReduce
§  What sort of problems can be solved with Hadoop
§  What other projects are included in the Hadoop ecosystem
§  What resources are available to assist in managing your Hadoop
    installation




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-2
Traditional Large-Scale Computation

§  Traditionally, computation has been processor-bound
      –  Relatively small amounts of data
      –  Significant amount of complex processing performed on that data
§  For decades, the primary push was to increase the computing
    power of a single machine
     –  Faster processor, more RAM




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-3
Distributed Systems

  §  Moore’s Law: roughly stated, processing power doubles every
      two years
  §  Even that hasn’t always proved adequate for very CPU-intensive
      jobs
  §  Distributed systems evolved to allow developers to use multiple
      machines for a single job
        –  MPI
        –  PVM
        –  Condor




MPI: Message Passing Interface
PVM: Parallel Virtual Machine
                  Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-4
Data Becomes the Bottleneck

§  Getting the data to the processors becomes the bottleneck
§  Quick calculation
     –  Typical disk data transfer rate: 75MB/sec
     –  Time taken to transfer 100GB of data to the processor: approx 22
        minutes!
          –  Assuming sustained reads
          –  Actual time will be worse, since most servers have less than
             100GB of RAM available
§  A new approach is needed




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-5
Hadoop’s History

§  Hadoop is based on work done by Google in the early 2000s
     –  Specifically, on papers describing the Google File System (GFS)
        published in 2003, and MapReduce published in 2004
§  This work takes a radical new approach to the problem of
    distributed computing
      –  Meets all the requirements we have for reliability, scalability etc
§  Core concept: distribute the data as it is initially stored in the
    system
      –  Individual nodes can work on data local to those nodes
           –  No data transfer over the network is required for initial
              processing




             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-6
Core Hadoop Concepts

§  Applications are written in high-level code
     –  Developers do not worry about network programming, temporal
        dependencies etc
§  Nodes talk to each other as little as possible
     –  Developers should not write code which communicates between
        nodes
     –  ‘Shared nothing’ architecture
§  Data is spread among machines in advance
     –  Computation happens where the data is stored, wherever
        possible
          –  Data is replicated multiple times on the system for increased
             availability and reliability


             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-7
Hadoop Components

§  Hadoop consists of two core components
     –  The Hadoop Distributed File System (HDFS)
     –  MapReduce Software Framework
§  There are many other projects based around core Hadoop
     –  Often referred to as the ‘Hadoop Ecosystem’
     –  Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
         –  Many are discussed later in the course
§  A set of machines running HDFS and MapReduce is known as a
    Hadoop Cluster
      –  Individual machines are known as nodes
      –  A cluster can have as few as one node, as many as several
         thousands
           –  More nodes = better performance!

            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-8
Hadoop Components: HDFS

§  HDFS, the Hadoop Distributed File System, is responsible for
    storing data on the cluster
§  Data files are split into blocks and distributed across multiple
    nodes in the cluster
§  Each block is replicated multiple times
     –  Default is to replicate each block three times
     –  Replicas are stored on different nodes
     –  This ensures both reliability and availability




             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-9
Hadoop Components: MapReduce

§  MapReduce is the system used to process data in the Hadoop
    cluster
§  Consists of two phases: Map, and then Reduce
§  Each Map task operates on a discrete portion of the overall
    dataset
      –  Typically one HDFS data block
§  After all Maps are complete, the MapReduce system distributes
    the intermediate data to nodes which perform the Reduce phase




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-10
MapReduce Example: Word Count

§  Count the number of occurrences of each word in a large
    amount of input data

Map(input_key, input_value)
   foreach word w in input_value:
       emit(w, 1)


§  Input to the Mapper

(3414, 'the cat sat on the mat')
(3437, 'the aardvark sat on the sofa')

§  Output from the Mapper

('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-11
Map Phase

                                                                                          Mapper Output
                                                                                              The, 1
                                                                                              cat, 1
                                                                                              sat, 1
               Mapper Input                                                                   on, 1
                                                                                              the, 1
    The cat sat on the mat                                                                    mat, 1
    The aardvark sat on the sofa                                                              The, 1
                                                                                              aardvark, 1
                                                                                              sat, 1
                                                                                              on, 1
                                                                                              the, 1
                                                                                              sofa, 1




         Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-12
MapReduce: The Reducer

§  After the Map phase is over, all the intermediate values for a
    given intermediate key are combined together into a list
§  This list is given to a Reducer
     –  There may be a single Reducer, or multiple Reducers
     –  All values associated with a particular intermediate key are
        guaranteed to go to the same Reducer
     –  The intermediate keys, and their value lists, are passed to the
        Reducer in sorted key order
     –  This step is known as the ‘shuffle and sort’
§  The Reducer outputs zero or more final key/value pairs
     –  These are written to HDFS
     –  In practice, the Reducer usually emits a single key/value pair for
        each input key
             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-13
Example Reducer: Sum Reducer

§  Add up all the values associated with each intermediate key:

 reduce(output_key,                                                                        Reducer Input               Reducer Output
           intermediate_vals)                                                              aardvark, 1                 aardvark, 1
    set count = 0
    foreach v in intermediate_vals:                                                        cat, 1                      cat, 1
        count += v                                                                         mat, 1                      mat, 1
    emit(output_key, count)
                                                                                           on [1, 1]                   on, 2
§  Reducer output:                                                                        sat [1, 1]                  sat , 2

 ('aardvark', 1)                                                                           sofa, 1                     sofa, 1
 ('cat', 1)
 ('mat', 1)                                                                                the [1, 1, 1, 1]            the, 4
 ('on', 2)
 ('sat', 2)
 ('sofa', 1)
 ('the', 4)


            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.             01-14
Putting It All Together


                                        The	
  overall	
  word	
  count	
  process	
  

                                                    Mapping                     Shuffling                   Reducing
                                                  The, 1                     aardvark, 1                 aardvark, 1
                                                  cat, 1
                                                                             cat, 1                                               Final Result
                                                  sat, 1                                                 cat, 1
                                                  on, 1                                                                           aardvark, 1
        Mapper Input
                                                  the, 1                     mat, 1                      mat, 1                   cat, 1
The cat sat on the mat                            mat, 1                                                                          mat, 1
                                                  The, 1                     on [1, 1]                   on, 2                    on, 2
The aardvark sat on the sofa
                                                  aardvark, 1                                                                     sat, 2
                                                  sat, 1                     sat [1, 1]                  sat, 2                   sofa, 1
                                                  on, 1                                                                           the, 4
                                                  the, 1                     sofa, 1                     sofa, 1
                                                  sofa, 1
                                                                             the [1, 1, 1, 1]            the, 4




                       Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.           01-15
Why Do We Care About Counting Words?

§  Word count is challenging over massive amounts of data
     –  Using a single compute node would be too time-consuming
     –  Using distributed nodes require moving data
     –  Number of unique words can easily exceed the RAM
         –  Would need a hash table on disk
     –  Would need to partition the results (sort and shuffle)
§  Fundamentals of statistics often are simple aggregate functions
§  Most aggregation functions have distributive nature
     –  e.g., max, min, sum, count
§  MapReduce breaks complex tasks down into smaller elements
    which can be executed in parallel



            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-16
What is Common Across Hadoop-able
Problems?
§  Nature of the data
     –  Complex data
     –  Multiple data sources
     –  Lots of it




§  Nature of the analysis
     –  Batch processing
     –  Parallel execution
     –  Spread data over a cluster of servers and take the computation
        to the data

             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-17
Where Does Data Come From?

§  Science
     –  Medical imaging, sensor data, genome sequencing, weather
        data, satellite feeds, etc.
§  Industry
      –  Financial, pharmaceutical, manufacturing, insurance, online,
         energy, retail data
§  Legacy
      –  Sales data, customer behavior, product databases, accounting
         data, etc.
§  System Data
     –  Log files, health & status feeds, activity streams, network
        messages, Web Analytics, intrusion detection, spam filters


             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-18
What Analysis is Possible With Hadoop?

§  Text mining                                                             §  Collaborative filtering

§  Index building                                                          §  Prediction models

§  Graph creation and analysis                                             §  Sentiment analysis

§  Pattern recognition                                                     §  Risk assessment




             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-19
Benefits of Analyzing With Hadoop

§  Previously impossible/impractical to do this analysis
§  Analysis conducted at lower cost
§  Analysis conducted in less time
§  Greater flexibility
§  Linear scalability




              Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-20
Facts about Apache Hadoop

§  Open source (using the Apache license)
§  Around 40 core Hadoop committers from ~10 companies
     –  Cloudera, Yahoo!, Facebook, Apple and more
§  Hundreds of contributors writing features, fixing bugs
§  Many related projects, applications, tools, etc.




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-21
A large ecosystem




       Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-22
Who uses Hadoop?




      Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-23
Vendor integration




        Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-24
About Cloudera

§  Cloudera is “The commercial Hadoop company”
§  Founded by leading experts on Hadoop from Facebook, Google,
    Oracle and Yahoo
§  Provides consulting and training services for Hadoop users
§  Staff includes several committers to Hadoop projects




           Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-25
Cloudera Software (All Open-Source)

§  Cloudera’s Distribution including Apache Hadoop (CDH)
      –  A single, easy-to-install package from the Apache Hadoop core
         repository
      –  Includes a stable version of Hadoop, plus critical bug fixes and
         solid new features from the development version
§  Components
     –  Apache Hadoop
     –  Apache Hive
     –  Apache Pig
     –  Apache HBase
     –  Apache Zookeeper
     –  Flume, Hue, Oozie, and Sqoop



             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-26
A Coherent Platform




                                                    Hue                                                             Hue SDK



                        Oozie                                                 Oozie                                     Hive


                                                                                         Pig/
                                                                                         Hive




      Flume/Sqoop                                                                                                     HBase



                                                                                                                   Zookeeper




        Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.        01-27
Cloudera Software (Licensed)

§  Cloudera Enterprise
      –  Cloudera’s Distribution including Apache Hadoop (CDH)
      –  Management applications
      –  Production support
§  Management Apps
     –  Authorization Management and Provisioning
     –  Resource Management
     –  Integration Configuration and Monitoring




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-28
Conclusion

§  Apache Hadoop is a fast-growing data framework
§  Cloudera’s Distribution including Apache Hadoop offers a free,
    cohesive platform that encapsulates:
      –  Data integration
      –  Data processing
      –  Workflow scheduling
      –  Monitoring




           Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-29

More Related Content

Similar to Introduction to Hadoop

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopVeda Vyas
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
EYevseyeva
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
BTI360
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
zahid-mian
 
Hadoop
HadoopHadoop
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
fvanvollenhoven
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
marklpollack
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2
ovarene
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
datasalt
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
t3rmin4t0r
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
ThoughtWorks
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
Nikita Kesharwani
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 

Similar to Introduction to Hadoop (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Buzz words
Buzz wordsBuzz words
Buzz words
 

Recently uploaded

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

Introduction to Hadoop

  • 1. Introduction to Hadoop 23 Feb 2012, Sydney Mark Fei mark.fei@cloudera.com Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 201109 01-1
  • 2. Agenda During this talk, we’ll cover: §  Why Hadoop is needed §  The basic concepts of HDFS and MapReduce §  What sort of problems can be solved with Hadoop §  What other projects are included in the Hadoop ecosystem §  What resources are available to assist in managing your Hadoop installation Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-2
  • 3. Traditional Large-Scale Computation §  Traditionally, computation has been processor-bound –  Relatively small amounts of data –  Significant amount of complex processing performed on that data §  For decades, the primary push was to increase the computing power of a single machine –  Faster processor, more RAM Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-3
  • 4. Distributed Systems §  Moore’s Law: roughly stated, processing power doubles every two years §  Even that hasn’t always proved adequate for very CPU-intensive jobs §  Distributed systems evolved to allow developers to use multiple machines for a single job –  MPI –  PVM –  Condor MPI: Message Passing Interface PVM: Parallel Virtual Machine Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-4
  • 5. Data Becomes the Bottleneck §  Getting the data to the processors becomes the bottleneck §  Quick calculation –  Typical disk data transfer rate: 75MB/sec –  Time taken to transfer 100GB of data to the processor: approx 22 minutes! –  Assuming sustained reads –  Actual time will be worse, since most servers have less than 100GB of RAM available §  A new approach is needed Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-5
  • 6. Hadoop’s History §  Hadoop is based on work done by Google in the early 2000s –  Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004 §  This work takes a radical new approach to the problem of distributed computing –  Meets all the requirements we have for reliability, scalability etc §  Core concept: distribute the data as it is initially stored in the system –  Individual nodes can work on data local to those nodes –  No data transfer over the network is required for initial processing Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-6
  • 7. Core Hadoop Concepts §  Applications are written in high-level code –  Developers do not worry about network programming, temporal dependencies etc §  Nodes talk to each other as little as possible –  Developers should not write code which communicates between nodes –  ‘Shared nothing’ architecture §  Data is spread among machines in advance –  Computation happens where the data is stored, wherever possible –  Data is replicated multiple times on the system for increased availability and reliability Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-7
  • 8. Hadoop Components §  Hadoop consists of two core components –  The Hadoop Distributed File System (HDFS) –  MapReduce Software Framework §  There are many other projects based around core Hadoop –  Often referred to as the ‘Hadoop Ecosystem’ –  Pig, Hive, HBase, Flume, Oozie, Sqoop, etc –  Many are discussed later in the course §  A set of machines running HDFS and MapReduce is known as a Hadoop Cluster –  Individual machines are known as nodes –  A cluster can have as few as one node, as many as several thousands –  More nodes = better performance! Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-8
  • 9. Hadoop Components: HDFS §  HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster §  Data files are split into blocks and distributed across multiple nodes in the cluster §  Each block is replicated multiple times –  Default is to replicate each block three times –  Replicas are stored on different nodes –  This ensures both reliability and availability Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-9
  • 10. Hadoop Components: MapReduce §  MapReduce is the system used to process data in the Hadoop cluster §  Consists of two phases: Map, and then Reduce §  Each Map task operates on a discrete portion of the overall dataset –  Typically one HDFS data block §  After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-10
  • 11. MapReduce Example: Word Count §  Count the number of occurrences of each word in a large amount of input data Map(input_key, input_value) foreach word w in input_value: emit(w, 1) §  Input to the Mapper (3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa') §  Output from the Mapper ('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1) Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-11
  • 12. Map Phase Mapper Output The, 1 cat, 1 sat, 1 Mapper Input on, 1 the, 1 The cat sat on the mat mat, 1 The aardvark sat on the sofa The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-12
  • 13. MapReduce: The Reducer §  After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list §  This list is given to a Reducer –  There may be a single Reducer, or multiple Reducers –  All values associated with a particular intermediate key are guaranteed to go to the same Reducer –  The intermediate keys, and their value lists, are passed to the Reducer in sorted key order –  This step is known as the ‘shuffle and sort’ §  The Reducer outputs zero or more final key/value pairs –  These are written to HDFS –  In practice, the Reducer usually emits a single key/value pair for each input key Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-13
  • 14. Example Reducer: Sum Reducer §  Add up all the values associated with each intermediate key: reduce(output_key, Reducer Input Reducer Output intermediate_vals) aardvark, 1 aardvark, 1 set count = 0 foreach v in intermediate_vals: cat, 1 cat, 1 count += v mat, 1 mat, 1 emit(output_key, count) on [1, 1] on, 2 §  Reducer output: sat [1, 1] sat , 2 ('aardvark', 1) sofa, 1 sofa, 1 ('cat', 1) ('mat', 1) the [1, 1, 1, 1] the, 4 ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4) Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-14
  • 15. Putting It All Together The  overall  word  count  process   Mapping Shuffling Reducing The, 1 aardvark, 1 aardvark, 1 cat, 1 cat, 1 Final Result sat, 1 cat, 1 on, 1 aardvark, 1 Mapper Input the, 1 mat, 1 mat, 1 cat, 1 The cat sat on the mat mat, 1 mat, 1 The, 1 on [1, 1] on, 2 on, 2 The aardvark sat on the sofa aardvark, 1 sat, 2 sat, 1 sat [1, 1] sat, 2 sofa, 1 on, 1 the, 4 the, 1 sofa, 1 sofa, 1 sofa, 1 the [1, 1, 1, 1] the, 4 Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-15
  • 16. Why Do We Care About Counting Words? §  Word count is challenging over massive amounts of data –  Using a single compute node would be too time-consuming –  Using distributed nodes require moving data –  Number of unique words can easily exceed the RAM –  Would need a hash table on disk –  Would need to partition the results (sort and shuffle) §  Fundamentals of statistics often are simple aggregate functions §  Most aggregation functions have distributive nature –  e.g., max, min, sum, count §  MapReduce breaks complex tasks down into smaller elements which can be executed in parallel Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-16
  • 17. What is Common Across Hadoop-able Problems? §  Nature of the data –  Complex data –  Multiple data sources –  Lots of it §  Nature of the analysis –  Batch processing –  Parallel execution –  Spread data over a cluster of servers and take the computation to the data Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-17
  • 18. Where Does Data Come From? §  Science –  Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. §  Industry –  Financial, pharmaceutical, manufacturing, insurance, online, energy, retail data §  Legacy –  Sales data, customer behavior, product databases, accounting data, etc. §  System Data –  Log files, health & status feeds, activity streams, network messages, Web Analytics, intrusion detection, spam filters Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-18
  • 19. What Analysis is Possible With Hadoop? §  Text mining §  Collaborative filtering §  Index building §  Prediction models §  Graph creation and analysis §  Sentiment analysis §  Pattern recognition §  Risk assessment Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-19
  • 20. Benefits of Analyzing With Hadoop §  Previously impossible/impractical to do this analysis §  Analysis conducted at lower cost §  Analysis conducted in less time §  Greater flexibility §  Linear scalability Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-20
  • 21. Facts about Apache Hadoop §  Open source (using the Apache license) §  Around 40 core Hadoop committers from ~10 companies –  Cloudera, Yahoo!, Facebook, Apple and more §  Hundreds of contributors writing features, fixing bugs §  Many related projects, applications, tools, etc. Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-21
  • 22. A large ecosystem Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-22
  • 23. Who uses Hadoop? Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-23
  • 24. Vendor integration Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-24
  • 25. About Cloudera §  Cloudera is “The commercial Hadoop company” §  Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo §  Provides consulting and training services for Hadoop users §  Staff includes several committers to Hadoop projects Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-25
  • 26. Cloudera Software (All Open-Source) §  Cloudera’s Distribution including Apache Hadoop (CDH) –  A single, easy-to-install package from the Apache Hadoop core repository –  Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version §  Components –  Apache Hadoop –  Apache Hive –  Apache Pig –  Apache HBase –  Apache Zookeeper –  Flume, Hue, Oozie, and Sqoop Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-26
  • 27. A Coherent Platform Hue Hue SDK Oozie Oozie Hive Pig/ Hive Flume/Sqoop HBase Zookeeper Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-27
  • 28. Cloudera Software (Licensed) §  Cloudera Enterprise –  Cloudera’s Distribution including Apache Hadoop (CDH) –  Management applications –  Production support §  Management Apps –  Authorization Management and Provisioning –  Resource Management –  Integration Configuration and Monitoring Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-28
  • 29. Conclusion §  Apache Hadoop is a fast-growing data framework §  Cloudera’s Distribution including Apache Hadoop offers a free, cohesive platform that encapsulates: –  Data integration –  Data processing –  Workflow scheduling –  Monitoring Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-29