MapReduce with Apache Hadoop
Analysing Big Data


April 2010
Gavin Heavyside
gavin.heavyside@journeydynamics.com
About Journey Dynamics
    •   Founded in 2006 to develop software technology to address the issues of
        congestion, fuel efficiency, driving safety and eco-driving
    •   Based in the Surrey Technology Centre, Guildford, UK
    •   Analyse large amounts (TB) of GPS data from cars, vans & trucks
    •   TrafficSpeedsEQ® - Accurate traffic speed forecasts by hour of day and day of week
        for every link in the road network
    •   MyDrive® - Unique & sophisticated system that learns how drivers behave
         Drivers can improve fuel economy
         Insurance companies can understand driver risk
         Navigation devices can improved route choice & ETA
         Fleet managers can monitor their fleet to improve safety & eco-driving




                                                                          © 2010 Journey Dynamics Ltd
2
Big Data

    • Data volumes increasing
    • NYSE: 1TB new trade data/day
    • Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agYjEL
    • LHC: 15PB data/year
    • Facebook: several TB photos uploaded/day




                                                             © 2010 Journey Dynamics Ltd
3
“Medium” Data
    • Most of us arenʼt at Google or Facebook scale
    • But: data at the GB/TB scale is becoming more common
    • Outgrow conventional databases
    • Disks are cheap, but slow




                        • 1TB drive - £50
                        • 2.5 hours to read 1TB at 100MB/s




                                                      © 2010 Journey Dynamics Ltd
4
Two Challenges

    • Managing lots of data

    • Doing something useful with it




                                       © 2010 Journey Dynamics Ltd
5
Managing Lots of Data

    • Access and analyse any or all of your data
    • SAN technologies (FC, iSCSI, NFS)
    • Querying (MySQL, PostgreSQL, Oracle)



    ➡ Cost, network bandwidth, concurrent access, resilience

    ➡ When you have 1000s of nodes, MTBF < 1 day




                                                        © 2010 Journey Dynamics Ltd
6
Analysing Lots of Data
    • Parallel processing
    • HPC
    • Grid Computing
    • MPI
    • Sharding

    ➡ Too big for memory, specialised HW, complex, scalability
    ➡ Hardware reliability in large clusters




                                                        © 2010 Journey Dynamics Ltd
7
Apache Hadoop
    • Reliable, scalable distributed computing platform

    • HDFS - high throughput fault-tolerant distributed file system
    • MapReduce - fault-tolerant distributed processing

    • Runs on commodity hardware
    • Cost-effective

    • Open source (Apache License)




                                                          © 2010 Journey Dynamics Ltd
8
Hadoop History
    • 2003-2004 Google publishes MapReduce & GFS papers
    • 2004 Doug Cutting add DFS & MapReduce to Nutch
    • 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch
    • Jan 2008 - top level Apache project
    • April 2010: 95 companies on PoweredBy Hadoop wiki
    • Yahoo!, Twitter, Facebook, Microsoft, New York Times,
    LinkedIn, Last.fm, IBM, Baidu, Adobe
            "The name my kid gave a stuffed yellow elephant.
            Short, relatively easy to spell and pronounce,
            meaningless, and not used elsewhere: those are my
            naming criteria. Kids are good at generating such.
            Googol is a kid's term"
            Doug Cutting


                                                                 © 2010 Journey Dynamics Ltd
9
Hadoop Ecosystem
     • HDFS
     • MapReduce

     • HBase
     • ZooKeeper
     • Pig
     • Hive

     • Chukwa
     • Avro




                        © 2010 Journey Dynamics Ltd
10
Anatomy of a Hadoop Cluster


         Namenode          Tasktracker
                            Tasktracker
                             Tasktracker
                              Tasktracker
                            Datanode        Rack 1
                             Datanode
                              Datanode
                               Datanode


          JobTracker
                           Tasktracker
                            Tasktracker
                             Tasktracker
                              Tasktracker
                            Datanode
                             Datanode       Rack n
                              Datanode
                               Datanode




                                               © 2010 Journey Dynamics Ltd
11
HDFS
     • Reliable shared storage
     • Modelled after GFS
     • Very large files
     • Streaming data access
     • Commodity Hardware
     • Replication
     • Tolerate regular hardware failure




                                           © 2010 Journey Dynamics Ltd
12
HDFS
     • Block size 64MB
     • Default replication factor = 3


              1
              2     HDFS         1      2   3   4   5
              3                  2      3   4   5   1
              4                  3      4   5   1   2
              5




                                                        © 2010 Journey Dynamics Ltd
13
HDFS
     • Block size 64MB
     • Default replication factor = 3


              1
              2     HDFS         1      2   3   4   5
              3                  2      3   4   5   1
              4                  3      4   5   1   2
              5




                                                        © 2010 Journey Dynamics Ltd
13
MapReduce
     • Based on 2004 Google paper
     • Concepts from Functional Programming
     • Used for lots of things within Google (and now everywhere)
     • Parallel Map => Shuffle & Sort => Parallel Reduce
     • Easy to understand and write MapReduce programs
     • Move the computation to the data
     • Rack-aware
     • Linear Scalability
     • Works with HDFS, S3, KFS, file:// and more




                                                          © 2010 Journey Dynamics Ltd
14
MapReduce
     • “Single Threaded” MapReduce:
             cat input/* | map | sort | reduce > output

     • Map program parses the input and emits [key,value] pairs
     • Sort by key
     • Reduce computes output from values with same key

                                                  Reduce
                Map              Sort




     • Extrapolate to PB of data on thousands of nodes


                                                           © 2010 Journey Dynamics Ltd
15
MapReduce
     • Distributed Example
                       sort
       Split 0
       HDFS      Map          copy


                                     merge
                       sort
       Split 1                                        part 0
       HDFS      Map                         Reduce   HDFS


                                     merge

                                                      part 1
                                             Reduce   HDFS

                       sort
       Split n
       HDFS      Map




                                                         © 2010 Journey Dynamics Ltd
16
MapReduce can be good for:
     • “Embarrassingly Parallel” problems
     • Semi-structured or unstructured data
     • Index generation
     • Log analysis
     • Statistical analysis of patterns in data
     • Image processing
     • Generating map tiles
     • Data Mining
     • Much, much more




                                                  © 2010 Journey Dynamics Ltd
17
MapReduce is not be good for:
     • Real-time or low-latency queries
     • Some graph algorithms
     • Algorithms that canʼt be split into independent chunks
     • Some types of joins*
     • Not a replacement for RDBMS




     * Can be tricky to write unless you use an abstraction e.g. Pig, Hive




                                                                             © 2010 Journey Dynamics Ltd
18
Writing MapReduce Programs
     • Java
     • Pipes (C++, sockets)
     • Streaming
     • Frameworks, e.g. wukong(ruby), dumbo(python)
     • JVM languages e.g. JRuby, Clojure, Scala
     • Cascading.org
     • Cascalog
     • Pig
     • Hive




                                                      © 2010 Journey Dynamics Ltd
19
Streaming Example (ruby)
     • mapper.rb




     • reducer.rb




                                © 2010 Journey Dynamics Ltd
20
Pig
     • High level language for writing data analysis programs
     • Runs MapReduce jobs
     • Joins, grouping, filtering, sorting, statistical functions
     • User-defined functions
     • Optional schemas
     • Sampling
     • Pig Latin similar to imperative language, define steps to run




                                                           © 2010 Journey Dynamics Ltd
21
Pig Example




                   © 2010 Journey Dynamics Ltd
22
Hive
     • Data warehousing and querying
     • HiveQL - SQL-like language for querying data
     • Runs MapReduce jobs
     • Joins, grouping, filtering, sorting, statistical functions
     • Partitioning of data
     • User-defined functions
     • Sampling
     • Declarative syntax




                                                                   © 2010 Journey Dynamics Ltd
23
Hive Example




                    © 2010 Journey Dynamics Ltd
24
Getting Started
     • http://hadoop.apache.org
     • Cloudera Distribution (VM, source, rpm, deb)
     • Elastic MapReduce

     • Cloudera VM
     • Pseudo-distributed cluster




                                                      © 2010 Journey Dynamics Ltd
25
Learn More
     • http://hadoop.apache.org
     • Books




     • Mailing Lists
     • Commercial Support & Training, e.g. Cloudera




                                                      © 2010 Journey Dynamics Ltd
26
Related
     • Cassandra 0.6 has Hadoop integration - run MapReduce
     jobs against data in Cassandra
     • NoSQL DBs with MapReduce functionality include
     CouchDB, MongoDB, Riak and more
     • RDBMS with MapReduce include Aster, Greenplum,
     HadoopDB and more




                                                      © 2010 Journey Dynamics Ltd
27
Gavin Heavyside
     gavin.heavyside@journeydynamics.com
          www.journeydynamics.com




                                           © 2010 Journey Dynamics Ltd
28

Introduction to Hadoop - ACCU2010

  • 1.
    MapReduce with ApacheHadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com
  • 2.
    About Journey Dynamics • Founded in 2006 to develop software technology to address the issues of congestion, fuel efficiency, driving safety and eco-driving • Based in the Surrey Technology Centre, Guildford, UK • Analyse large amounts (TB) of GPS data from cars, vans & trucks • TrafficSpeedsEQ® - Accurate traffic speed forecasts by hour of day and day of week for every link in the road network • MyDrive® - Unique & sophisticated system that learns how drivers behave Drivers can improve fuel economy Insurance companies can understand driver risk Navigation devices can improved route choice & ETA Fleet managers can monitor their fleet to improve safety & eco-driving © 2010 Journey Dynamics Ltd 2
  • 3.
    Big Data • Data volumes increasing • NYSE: 1TB new trade data/day • Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agYjEL • LHC: 15PB data/year • Facebook: several TB photos uploaded/day © 2010 Journey Dynamics Ltd 3
  • 4.
    “Medium” Data • Most of us arenʼt at Google or Facebook scale • But: data at the GB/TB scale is becoming more common • Outgrow conventional databases • Disks are cheap, but slow • 1TB drive - £50 • 2.5 hours to read 1TB at 100MB/s © 2010 Journey Dynamics Ltd 4
  • 5.
    Two Challenges • Managing lots of data • Doing something useful with it © 2010 Journey Dynamics Ltd 5
  • 6.
    Managing Lots ofData • Access and analyse any or all of your data • SAN technologies (FC, iSCSI, NFS) • Querying (MySQL, PostgreSQL, Oracle) ➡ Cost, network bandwidth, concurrent access, resilience ➡ When you have 1000s of nodes, MTBF < 1 day © 2010 Journey Dynamics Ltd 6
  • 7.
    Analysing Lots ofData • Parallel processing • HPC • Grid Computing • MPI • Sharding ➡ Too big for memory, specialised HW, complex, scalability ➡ Hardware reliability in large clusters © 2010 Journey Dynamics Ltd 7
  • 8.
    Apache Hadoop • Reliable, scalable distributed computing platform • HDFS - high throughput fault-tolerant distributed file system • MapReduce - fault-tolerant distributed processing • Runs on commodity hardware • Cost-effective • Open source (Apache License) © 2010 Journey Dynamics Ltd 8
  • 9.
    Hadoop History • 2003-2004 Google publishes MapReduce & GFS papers • 2004 Doug Cutting add DFS & MapReduce to Nutch • 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch • Jan 2008 - top level Apache project • April 2010: 95 companies on PoweredBy Hadoop wiki • Yahoo!, Twitter, Facebook, Microsoft, New York Times, LinkedIn, Last.fm, IBM, Baidu, Adobe "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term" Doug Cutting © 2010 Journey Dynamics Ltd 9
  • 10.
    Hadoop Ecosystem • HDFS • MapReduce • HBase • ZooKeeper • Pig • Hive • Chukwa • Avro © 2010 Journey Dynamics Ltd 10
  • 11.
    Anatomy of aHadoop Cluster Namenode Tasktracker Tasktracker Tasktracker Tasktracker Datanode Rack 1 Datanode Datanode Datanode JobTracker Tasktracker Tasktracker Tasktracker Tasktracker Datanode Datanode Rack n Datanode Datanode © 2010 Journey Dynamics Ltd 11
  • 12.
    HDFS • Reliable shared storage • Modelled after GFS • Very large files • Streaming data access • Commodity Hardware • Replication • Tolerate regular hardware failure © 2010 Journey Dynamics Ltd 12
  • 13.
    HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
  • 14.
    HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
  • 15.
    MapReduce • Based on 2004 Google paper • Concepts from Functional Programming • Used for lots of things within Google (and now everywhere) • Parallel Map => Shuffle & Sort => Parallel Reduce • Easy to understand and write MapReduce programs • Move the computation to the data • Rack-aware • Linear Scalability • Works with HDFS, S3, KFS, file:// and more © 2010 Journey Dynamics Ltd 14
  • 16.
    MapReduce • “Single Threaded” MapReduce: cat input/* | map | sort | reduce > output • Map program parses the input and emits [key,value] pairs • Sort by key • Reduce computes output from values with same key Reduce Map Sort • Extrapolate to PB of data on thousands of nodes © 2010 Journey Dynamics Ltd 15
  • 17.
    MapReduce • Distributed Example sort Split 0 HDFS Map copy merge sort Split 1 part 0 HDFS Map Reduce HDFS merge part 1 Reduce HDFS sort Split n HDFS Map © 2010 Journey Dynamics Ltd 16
  • 18.
    MapReduce can begood for: • “Embarrassingly Parallel” problems • Semi-structured or unstructured data • Index generation • Log analysis • Statistical analysis of patterns in data • Image processing • Generating map tiles • Data Mining • Much, much more © 2010 Journey Dynamics Ltd 17
  • 19.
    MapReduce is notbe good for: • Real-time or low-latency queries • Some graph algorithms • Algorithms that canʼt be split into independent chunks • Some types of joins* • Not a replacement for RDBMS * Can be tricky to write unless you use an abstraction e.g. Pig, Hive © 2010 Journey Dynamics Ltd 18
  • 20.
    Writing MapReduce Programs • Java • Pipes (C++, sockets) • Streaming • Frameworks, e.g. wukong(ruby), dumbo(python) • JVM languages e.g. JRuby, Clojure, Scala • Cascading.org • Cascalog • Pig • Hive © 2010 Journey Dynamics Ltd 19
  • 21.
    Streaming Example (ruby) • mapper.rb • reducer.rb © 2010 Journey Dynamics Ltd 20
  • 22.
    Pig • High level language for writing data analysis programs • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • User-defined functions • Optional schemas • Sampling • Pig Latin similar to imperative language, define steps to run © 2010 Journey Dynamics Ltd 21
  • 23.
    Pig Example © 2010 Journey Dynamics Ltd 22
  • 24.
    Hive • Data warehousing and querying • HiveQL - SQL-like language for querying data • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • Partitioning of data • User-defined functions • Sampling • Declarative syntax © 2010 Journey Dynamics Ltd 23
  • 25.
    Hive Example © 2010 Journey Dynamics Ltd 24
  • 26.
    Getting Started • http://hadoop.apache.org • Cloudera Distribution (VM, source, rpm, deb) • Elastic MapReduce • Cloudera VM • Pseudo-distributed cluster © 2010 Journey Dynamics Ltd 25
  • 27.
    Learn More • http://hadoop.apache.org • Books • Mailing Lists • Commercial Support & Training, e.g. Cloudera © 2010 Journey Dynamics Ltd 26
  • 28.
    Related • Cassandra 0.6 has Hadoop integration - run MapReduce jobs against data in Cassandra • NoSQL DBs with MapReduce functionality include CouchDB, MongoDB, Riak and more • RDBMS with MapReduce include Aster, Greenplum, HadoopDB and more © 2010 Journey Dynamics Ltd 27
  • 29.
    Gavin Heavyside gavin.heavyside@journeydynamics.com www.journeydynamics.com © 2010 Journey Dynamics Ltd 28

Editor's Notes

  • #4 Andrei Alexandrescu - 1300ish photos/second at facebook
  • #7 Use only subset/sample of data A server might have a MTBF of a few years With thousands you can expect failures at least daily System needs to be handle HW failures
  • #9 Explain what we mean by commodity hardware 8-core, 16-24GB ram, 4x1TB disks, gig ethernet
  • #10 95 Companies not exhaustive!
  • #11 Zookeeper - distributed synchronisation service Locking, race conditions. Used for HBase distributed column-oriented database Chukwa is large-scale log collection &amp; analysis tools Avro - like protocol buffers - move towards default IPC mechanism in Hadoop
  • #12 Mention secondary namenode Namenode is point of failure - not highly available Mention lots of disks - JBOD: don&amp;#x2019;t stripe, don&amp;#x2019;t RAID Rack-aware - splits will be processed local to the HDFS blocks where possible. Pseudo distributed cluster can run all services on single machine DN/TT on lots of machines, NN, SNN, JT depend on cluster size.
  • #13 Optimised for large (GB) files 64MB block size not efficient for lots of small files -&gt; concatenate them
  • #15 Text data Binary data (protocol buffers, avro) Custom input/output formats
  • #17 Compare with previous example, splits on each node, shuffle to reduce nodes Talk about partitioning Reducers have to wait until the map and sort/shuffle has finished Combiners can run on Tasktracker Can be single Reducer or many
  • #18 Mention Speculative Execution Forward index: list of words per document Inverted Index: list of documents containing word Once you &amp;#x201C;get&amp;#x201D; it, you start to see how lots of problems can be parallelised Map Tiles at Google - Tech Talk on YouTube
  • #19 Algorithm has a strict data dependency between iterations? Might not be efficient to parallelise with MapReduce
  • #20 JVM languages can access the Hadoop Java classes Streaming works in any language that can read/write stdin/stdout Cascalog brand new; only heard about it last night
  • #21 Implements a map-side hash join
  • #22 SAMPLE command samples random parts of data SPLIT AVG, COUNT, CONCAT, MIN, MAX, SUM Custom input/output formats
  • #24 Hive developed at Facebook, contributed back to community SQL is a declarative language, you specify an outcome, and the query planner generates a sequence of steps. Ad-hoc queries using basic SQL Custom input/output formats
  • #25 Create Table Load Data Select/Join/Group
  • #27 Cloudera do developer training and sysadmin training Cloudera do commercial support Other providers on Hadoop homepage
  • #28 Cassandra is an alternative to HBase - seems to have more momentum behind it. Plenty of hobbyist/small-scale MapReduce frameworks out there