Successfully reported this slideshow.
Your SlideShare is downloading. ×

Introduction to Hadoop - ACCU2010

Ad

MapReduce with Apache Hadoop
Analysing Big Data


April 2010
Gavin Heavyside
gavin.heavyside@journeydynamics.com

Ad

About Journey Dynamics
    •   Founded in 2006 to develop software technology to address the issues of
        congestion,...

Ad

Big Data

    • Data volumes increasing
    • NYSE: 1TB new trade data/day
    • Google: Processes 20PB/day (Sep 2007) htt...

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Ad

Loading in …3
×

Check these out next

1 of 29 Ad
1 of 29 Ad
Advertisement

More Related Content

Slideshows for you (17)

Advertisement

Similar to Introduction to Hadoop - ACCU2010 (20)

Advertisement

Introduction to Hadoop - ACCU2010

  1. 1. MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com
  2. 2. About Journey Dynamics • Founded in 2006 to develop software technology to address the issues of congestion, fuel efficiency, driving safety and eco-driving • Based in the Surrey Technology Centre, Guildford, UK • Analyse large amounts (TB) of GPS data from cars, vans & trucks • TrafficSpeedsEQ® - Accurate traffic speed forecasts by hour of day and day of week for every link in the road network • MyDrive® - Unique & sophisticated system that learns how drivers behave Drivers can improve fuel economy Insurance companies can understand driver risk Navigation devices can improved route choice & ETA Fleet managers can monitor their fleet to improve safety & eco-driving © 2010 Journey Dynamics Ltd 2
  3. 3. Big Data • Data volumes increasing • NYSE: 1TB new trade data/day • Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agYjEL • LHC: 15PB data/year • Facebook: several TB photos uploaded/day © 2010 Journey Dynamics Ltd 3
  4. 4. “Medium” Data • Most of us arenʼt at Google or Facebook scale • But: data at the GB/TB scale is becoming more common • Outgrow conventional databases • Disks are cheap, but slow • 1TB drive - £50 • 2.5 hours to read 1TB at 100MB/s © 2010 Journey Dynamics Ltd 4
  5. 5. Two Challenges • Managing lots of data • Doing something useful with it © 2010 Journey Dynamics Ltd 5
  6. 6. Managing Lots of Data • Access and analyse any or all of your data • SAN technologies (FC, iSCSI, NFS) • Querying (MySQL, PostgreSQL, Oracle) ➡ Cost, network bandwidth, concurrent access, resilience ➡ When you have 1000s of nodes, MTBF < 1 day © 2010 Journey Dynamics Ltd 6
  7. 7. Analysing Lots of Data • Parallel processing • HPC • Grid Computing • MPI • Sharding ➡ Too big for memory, specialised HW, complex, scalability ➡ Hardware reliability in large clusters © 2010 Journey Dynamics Ltd 7
  8. 8. Apache Hadoop • Reliable, scalable distributed computing platform • HDFS - high throughput fault-tolerant distributed file system • MapReduce - fault-tolerant distributed processing • Runs on commodity hardware • Cost-effective • Open source (Apache License) © 2010 Journey Dynamics Ltd 8
  9. 9. Hadoop History • 2003-2004 Google publishes MapReduce & GFS papers • 2004 Doug Cutting add DFS & MapReduce to Nutch • 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch • Jan 2008 - top level Apache project • April 2010: 95 companies on PoweredBy Hadoop wiki • Yahoo!, Twitter, Facebook, Microsoft, New York Times, LinkedIn, Last.fm, IBM, Baidu, Adobe "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term" Doug Cutting © 2010 Journey Dynamics Ltd 9
  10. 10. Hadoop Ecosystem • HDFS • MapReduce • HBase • ZooKeeper • Pig • Hive • Chukwa • Avro © 2010 Journey Dynamics Ltd 10
  11. 11. Anatomy of a Hadoop Cluster Namenode Tasktracker Tasktracker Tasktracker Tasktracker Datanode Rack 1 Datanode Datanode Datanode JobTracker Tasktracker Tasktracker Tasktracker Tasktracker Datanode Datanode Rack n Datanode Datanode © 2010 Journey Dynamics Ltd 11
  12. 12. HDFS • Reliable shared storage • Modelled after GFS • Very large files • Streaming data access • Commodity Hardware • Replication • Tolerate regular hardware failure © 2010 Journey Dynamics Ltd 12
  13. 13. HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
  14. 14. HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
  15. 15. MapReduce • Based on 2004 Google paper • Concepts from Functional Programming • Used for lots of things within Google (and now everywhere) • Parallel Map => Shuffle & Sort => Parallel Reduce • Easy to understand and write MapReduce programs • Move the computation to the data • Rack-aware • Linear Scalability • Works with HDFS, S3, KFS, file:// and more © 2010 Journey Dynamics Ltd 14
  16. 16. MapReduce • “Single Threaded” MapReduce: cat input/* | map | sort | reduce > output • Map program parses the input and emits [key,value] pairs • Sort by key • Reduce computes output from values with same key Reduce Map Sort • Extrapolate to PB of data on thousands of nodes © 2010 Journey Dynamics Ltd 15
  17. 17. MapReduce • Distributed Example sort Split 0 HDFS Map copy merge sort Split 1 part 0 HDFS Map Reduce HDFS merge part 1 Reduce HDFS sort Split n HDFS Map © 2010 Journey Dynamics Ltd 16
  18. 18. MapReduce can be good for: • “Embarrassingly Parallel” problems • Semi-structured or unstructured data • Index generation • Log analysis • Statistical analysis of patterns in data • Image processing • Generating map tiles • Data Mining • Much, much more © 2010 Journey Dynamics Ltd 17
  19. 19. MapReduce is not be good for: • Real-time or low-latency queries • Some graph algorithms • Algorithms that canʼt be split into independent chunks • Some types of joins* • Not a replacement for RDBMS * Can be tricky to write unless you use an abstraction e.g. Pig, Hive © 2010 Journey Dynamics Ltd 18
  20. 20. Writing MapReduce Programs • Java • Pipes (C++, sockets) • Streaming • Frameworks, e.g. wukong(ruby), dumbo(python) • JVM languages e.g. JRuby, Clojure, Scala • Cascading.org • Cascalog • Pig • Hive © 2010 Journey Dynamics Ltd 19
  21. 21. Streaming Example (ruby) • mapper.rb • reducer.rb © 2010 Journey Dynamics Ltd 20
  22. 22. Pig • High level language for writing data analysis programs • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • User-defined functions • Optional schemas • Sampling • Pig Latin similar to imperative language, define steps to run © 2010 Journey Dynamics Ltd 21
  23. 23. Pig Example © 2010 Journey Dynamics Ltd 22
  24. 24. Hive • Data warehousing and querying • HiveQL - SQL-like language for querying data • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • Partitioning of data • User-defined functions • Sampling • Declarative syntax © 2010 Journey Dynamics Ltd 23
  25. 25. Hive Example © 2010 Journey Dynamics Ltd 24
  26. 26. Getting Started • http://hadoop.apache.org • Cloudera Distribution (VM, source, rpm, deb) • Elastic MapReduce • Cloudera VM • Pseudo-distributed cluster © 2010 Journey Dynamics Ltd 25
  27. 27. Learn More • http://hadoop.apache.org • Books • Mailing Lists • Commercial Support & Training, e.g. Cloudera © 2010 Journey Dynamics Ltd 26
  28. 28. Related • Cassandra 0.6 has Hadoop integration - run MapReduce jobs against data in Cassandra • NoSQL DBs with MapReduce functionality include CouchDB, MongoDB, Riak and more • RDBMS with MapReduce include Aster, Greenplum, HadoopDB and more © 2010 Journey Dynamics Ltd 27
  29. 29. Gavin Heavyside gavin.heavyside@journeydynamics.com www.journeydynamics.com © 2010 Journey Dynamics Ltd 28

Editor's Notes



  • Andrei Alexandrescu - 1300ish photos/second at facebook


  • Use only subset/sample of data
    A server might have a MTBF of a few years
    With thousands you can expect failures at least daily
    System needs to be handle HW failures

  • Explain what we mean by commodity hardware
    8-core, 16-24GB ram, 4x1TB disks, gig ethernet

  • 95 Companies not exhaustive!
  • Zookeeper - distributed synchronisation service
    Locking, race conditions. Used for HBase distributed column-oriented database
    Chukwa is large-scale log collection &amp; analysis tools
    Avro - like protocol buffers - move towards default IPC mechanism in Hadoop
  • Mention secondary namenode
    Namenode is point of failure - not highly available
    Mention lots of disks - JBOD: don&amp;#x2019;t stripe, don&amp;#x2019;t RAID
    Rack-aware - splits will be processed local to the HDFS blocks where possible.
    Pseudo distributed cluster can run all services on single machine
    DN/TT on lots of machines, NN, SNN, JT depend on cluster size.
  • Optimised for large (GB) files
    64MB block size not efficient for lots of small files -&gt; concatenate them



  • Text data
    Binary data (protocol buffers, avro)
    Custom input/output formats


  • Compare with previous example, splits on each node, shuffle to reduce nodes
    Talk about partitioning
    Reducers have to wait until the map and sort/shuffle has finished
    Combiners can run on Tasktracker
    Can be single Reducer or many

  • Mention Speculative Execution
    Forward index: list of words per document
    Inverted Index: list of documents containing word
    Once you &amp;#x201C;get&amp;#x201D; it, you start to see how lots of problems can be parallelised
    Map Tiles at Google - Tech Talk on YouTube
  • Algorithm has a strict data dependency between iterations? Might not be efficient to parallelise with MapReduce
  • JVM languages can access the Hadoop Java classes
    Streaming works in any language that can read/write stdin/stdout
    Cascalog brand new; only heard about it last night
  • Implements a map-side hash join
  • SAMPLE command samples random parts of data
    SPLIT
    AVG, COUNT, CONCAT, MIN, MAX, SUM
    Custom input/output formats

  • Hive developed at Facebook, contributed back to community
    SQL is a declarative language, you specify an outcome, and the query planner generates a sequence of steps.
    Ad-hoc queries using basic SQL
    Custom input/output formats
  • Create Table
    Load Data
    Select/Join/Group

  • Cloudera do developer training and sysadmin training
    Cloudera do commercial support
    Other providers on Hadoop homepage

  • Cassandra is an alternative to HBase - seems to have more momentum behind it.
    Plenty of hobbyist/small-scale MapReduce frameworks out there

×