Your SlideShare is downloading. ×
Introduction to Hadoop - ACCU2010
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Introduction to Hadoop - ACCU2010

3,885
views

Published on

Slides from my Introduction to Hadoop session at the ACCU2010 conference

Slides from my Introduction to Hadoop session at the ACCU2010 conference

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,885
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
222
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide


  • Andrei Alexandrescu - 1300ish photos/second at facebook


  • Use only subset/sample of data
    A server might have a MTBF of a few years
    With thousands you can expect failures at least daily
    System needs to be handle HW failures

  • Explain what we mean by commodity hardware
    8-core, 16-24GB ram, 4x1TB disks, gig ethernet

  • 95 Companies not exhaustive!
  • Zookeeper - distributed synchronisation service
    Locking, race conditions. Used for HBase distributed column-oriented database
    Chukwa is large-scale log collection & analysis tools
    Avro - like protocol buffers - move towards default IPC mechanism in Hadoop
  • Mention secondary namenode
    Namenode is point of failure - not highly available
    Mention lots of disks - JBOD: don’t stripe, don’t RAID
    Rack-aware - splits will be processed local to the HDFS blocks where possible.
    Pseudo distributed cluster can run all services on single machine
    DN/TT on lots of machines, NN, SNN, JT depend on cluster size.
  • Optimised for large (GB) files
    64MB block size not efficient for lots of small files -> concatenate them



  • Text data
    Binary data (protocol buffers, avro)
    Custom input/output formats


  • Compare with previous example, splits on each node, shuffle to reduce nodes
    Talk about partitioning
    Reducers have to wait until the map and sort/shuffle has finished
    Combiners can run on Tasktracker
    Can be single Reducer or many

  • Mention Speculative Execution
    Forward index: list of words per document
    Inverted Index: list of documents containing word
    Once you “get” it, you start to see how lots of problems can be parallelised
    Map Tiles at Google - Tech Talk on YouTube
  • Algorithm has a strict data dependency between iterations? Might not be efficient to parallelise with MapReduce
  • JVM languages can access the Hadoop Java classes
    Streaming works in any language that can read/write stdin/stdout
    Cascalog brand new; only heard about it last night
  • Implements a map-side hash join
  • SAMPLE command samples random parts of data
    SPLIT
    AVG, COUNT, CONCAT, MIN, MAX, SUM
    Custom input/output formats

  • Hive developed at Facebook, contributed back to community
    SQL is a declarative language, you specify an outcome, and the query planner generates a sequence of steps.
    Ad-hoc queries using basic SQL
    Custom input/output formats
  • Create Table
    Load Data
    Select/Join/Group

  • Cloudera do developer training and sysadmin training
    Cloudera do commercial support
    Other providers on Hadoop homepage

  • Cassandra is an alternative to HBase - seems to have more momentum behind it.
    Plenty of hobbyist/small-scale MapReduce frameworks out there

  • Transcript

    • 1. MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com
    • 2. About Journey Dynamics • Founded in 2006 to develop software technology to address the issues of congestion, fuel efficiency, driving safety and eco-driving • Based in the Surrey Technology Centre, Guildford, UK • Analyse large amounts (TB) of GPS data from cars, vans & trucks • TrafficSpeedsEQ® - Accurate traffic speed forecasts by hour of day and day of week for every link in the road network • MyDrive® - Unique & sophisticated system that learns how drivers behave Drivers can improve fuel economy Insurance companies can understand driver risk Navigation devices can improved route choice & ETA Fleet managers can monitor their fleet to improve safety & eco-driving © 2010 Journey Dynamics Ltd 2
    • 3. Big Data • Data volumes increasing • NYSE: 1TB new trade data/day • Google: Processes 20PB/day (Sep 2007) http://tcrn.ch/agYjEL • LHC: 15PB data/year • Facebook: several TB photos uploaded/day © 2010 Journey Dynamics Ltd 3
    • 4. “Medium” Data • Most of us arenʼt at Google or Facebook scale • But: data at the GB/TB scale is becoming more common • Outgrow conventional databases • Disks are cheap, but slow • 1TB drive - £50 • 2.5 hours to read 1TB at 100MB/s © 2010 Journey Dynamics Ltd 4
    • 5. Two Challenges • Managing lots of data • Doing something useful with it © 2010 Journey Dynamics Ltd 5
    • 6. Managing Lots of Data • Access and analyse any or all of your data • SAN technologies (FC, iSCSI, NFS) • Querying (MySQL, PostgreSQL, Oracle) ➡ Cost, network bandwidth, concurrent access, resilience ➡ When you have 1000s of nodes, MTBF < 1 day © 2010 Journey Dynamics Ltd 6
    • 7. Analysing Lots of Data • Parallel processing • HPC • Grid Computing • MPI • Sharding ➡ Too big for memory, specialised HW, complex, scalability ➡ Hardware reliability in large clusters © 2010 Journey Dynamics Ltd 7
    • 8. Apache Hadoop • Reliable, scalable distributed computing platform • HDFS - high throughput fault-tolerant distributed file system • MapReduce - fault-tolerant distributed processing • Runs on commodity hardware • Cost-effective • Open source (Apache License) © 2010 Journey Dynamics Ltd 8
    • 9. Hadoop History • 2003-2004 Google publishes MapReduce & GFS papers • 2004 Doug Cutting add DFS & MapReduce to Nutch • 2006 Cutting joins Yahoo!, Hadoop moves out of Nutch • Jan 2008 - top level Apache project • April 2010: 95 companies on PoweredBy Hadoop wiki • Yahoo!, Twitter, Facebook, Microsoft, New York Times, LinkedIn, Last.fm, IBM, Baidu, Adobe "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such. Googol is a kid's term" Doug Cutting © 2010 Journey Dynamics Ltd 9
    • 10. Hadoop Ecosystem • HDFS • MapReduce • HBase • ZooKeeper • Pig • Hive • Chukwa • Avro © 2010 Journey Dynamics Ltd 10
    • 11. Anatomy of a Hadoop Cluster Namenode Tasktracker Tasktracker Tasktracker Tasktracker Datanode Rack 1 Datanode Datanode Datanode JobTracker Tasktracker Tasktracker Tasktracker Tasktracker Datanode Datanode Rack n Datanode Datanode © 2010 Journey Dynamics Ltd 11
    • 12. HDFS • Reliable shared storage • Modelled after GFS • Very large files • Streaming data access • Commodity Hardware • Replication • Tolerate regular hardware failure © 2010 Journey Dynamics Ltd 12
    • 13. HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
    • 14. HDFS • Block size 64MB • Default replication factor = 3 1 2 HDFS 1 2 3 4 5 3 2 3 4 5 1 4 3 4 5 1 2 5 © 2010 Journey Dynamics Ltd 13
    • 15. MapReduce • Based on 2004 Google paper • Concepts from Functional Programming • Used for lots of things within Google (and now everywhere) • Parallel Map => Shuffle & Sort => Parallel Reduce • Easy to understand and write MapReduce programs • Move the computation to the data • Rack-aware • Linear Scalability • Works with HDFS, S3, KFS, file:// and more © 2010 Journey Dynamics Ltd 14
    • 16. MapReduce • “Single Threaded” MapReduce: cat input/* | map | sort | reduce > output • Map program parses the input and emits [key,value] pairs • Sort by key • Reduce computes output from values with same key Reduce Map Sort • Extrapolate to PB of data on thousands of nodes © 2010 Journey Dynamics Ltd 15
    • 17. MapReduce • Distributed Example sort Split 0 HDFS Map copy merge sort Split 1 part 0 HDFS Map Reduce HDFS merge part 1 Reduce HDFS sort Split n HDFS Map © 2010 Journey Dynamics Ltd 16
    • 18. MapReduce can be good for: • “Embarrassingly Parallel” problems • Semi-structured or unstructured data • Index generation • Log analysis • Statistical analysis of patterns in data • Image processing • Generating map tiles • Data Mining • Much, much more © 2010 Journey Dynamics Ltd 17
    • 19. MapReduce is not be good for: • Real-time or low-latency queries • Some graph algorithms • Algorithms that canʼt be split into independent chunks • Some types of joins* • Not a replacement for RDBMS * Can be tricky to write unless you use an abstraction e.g. Pig, Hive © 2010 Journey Dynamics Ltd 18
    • 20. Writing MapReduce Programs • Java • Pipes (C++, sockets) • Streaming • Frameworks, e.g. wukong(ruby), dumbo(python) • JVM languages e.g. JRuby, Clojure, Scala • Cascading.org • Cascalog • Pig • Hive © 2010 Journey Dynamics Ltd 19
    • 21. Streaming Example (ruby) • mapper.rb • reducer.rb © 2010 Journey Dynamics Ltd 20
    • 22. Pig • High level language for writing data analysis programs • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • User-defined functions • Optional schemas • Sampling • Pig Latin similar to imperative language, define steps to run © 2010 Journey Dynamics Ltd 21
    • 23. Pig Example © 2010 Journey Dynamics Ltd 22
    • 24. Hive • Data warehousing and querying • HiveQL - SQL-like language for querying data • Runs MapReduce jobs • Joins, grouping, filtering, sorting, statistical functions • Partitioning of data • User-defined functions • Sampling • Declarative syntax © 2010 Journey Dynamics Ltd 23
    • 25. Hive Example © 2010 Journey Dynamics Ltd 24
    • 26. Getting Started • http://hadoop.apache.org • Cloudera Distribution (VM, source, rpm, deb) • Elastic MapReduce • Cloudera VM • Pseudo-distributed cluster © 2010 Journey Dynamics Ltd 25
    • 27. Learn More • http://hadoop.apache.org • Books • Mailing Lists • Commercial Support & Training, e.g. Cloudera © 2010 Journey Dynamics Ltd 26
    • 28. Related • Cassandra 0.6 has Hadoop integration - run MapReduce jobs against data in Cassandra • NoSQL DBs with MapReduce functionality include CouchDB, MongoDB, Riak and more • RDBMS with MapReduce include Aster, Greenplum, HadoopDB and more © 2010 Journey Dynamics Ltd 27
    • 29. Gavin Heavyside gavin.heavyside@journeydynamics.com www.journeydynamics.com © 2010 Journey Dynamics Ltd 28