Hadoop, HDFS and MapReduce


Published on

Presentation given at Dutch Java user group (NLJUG) University Sesssion

Published in: Technology
1 Comment
  • Thanks for this. I've got a couple of good articles written up about Hadoop and MapReduce if you need a starting point to see how we use it.
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop, HDFS and MapReduce

  1. 1. Hadoop andMapReduce Friso van Vollenhoven fvanvollenhoven@xebia.comThe workings of the elephant
  2. 2. Data everywhere‣ Global data volume grows exponentially‣ Information retrieval is BIG business these days‣ Need means of economically storing and processing large data sets
  3. 3. Opportunity‣ Commodity hardware is ultra cheap‣ CPU and storage even cheaper
  4. 4. Traditional solution‣ Store data in a (relational) database‣ Run batch jobs for processing
  5. 5. Problems with existing solutions‣ Databases are seek heavy; B-tree gives log(n) random accesses per update‣ Seeks are wasted time, nothing of value happens during seeks‣ Databases do not play well with commoditized hardware (SANs and 16 CPU machines are not in the price sweet spot of performance / $)‣ Databases were not built with horizontal scaling in mind
  6. 6. Solution: sort/merge vs. updating the B-tree‣ Eliminate the seeks, only sequential reading / writing‣ Work with batches for efficiency‣ Parallelize work load‣ Distribute processing and storage
  7. 7. History‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge optimization applies‣ 2004: Google publishes GFS and MapReduce papers‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve Nutch’ problem; later becomes standalone project‣ 2011: We’re here learning about it!
  8. 8. Hadoop foundations‣ Commodity hardware (3K - 7K $ machines)‣ Only sequential reads / writes‣ Distribution of data and processing across cluster‣ Built in reliability / fault tolerance / redundancy‣ Disk based, does not require data or indexes to fit in RAM‣ Apache licensed, Open Source Software
  9. 9. The US governmentbuilds their finger printsearch index usingHadoop.
  10. 10. The contents for the People You May Know feature iscreated by a chain of many MapReduce jobs thatrun daily. The jobs are reportedly a combination ofgraph traversal, clustering and assisted machinelearning.
  11. 11. Amazon’s Frequently Bought Together and Customers Who Bought This Item AlsoBought features are brought to you by MapReduce jobs. Recommendationbased on large sales transaction datasets is a much seen use case.
  12. 12. Top Chartsgenerated dailybased on millionsof users’ listeningbehavior.
  13. 13. Top searches used for auto-completion are re-generated daily by aMapReduce job using all searches for the past couple of days.Popularity for search terms can be based on counts, but also trendingand correlation with other datasets (e.g. trending on social media,news, charts in case of music and movies, best seller lists, etc.)
  14. 14. What is Hadoop
  15. 15. HadoopFilesystem Friso van Vollenhoven fvanvollenhoven@xebia.comHDFS
  16. 16. HDFS overview‣ Distributed filesystem‣ Consists of a single master node and multiple (many) data nodes‣ Files are split up blocks (typically 64MB)‣ Blocks are spread across data nodes in the cluster‣ Each block is replicated multiple times to different data nodes in the cluster (typically 3 times)‣ Master node keeps track of which blocks belong to a file
  17. 17. HDFS interaction‣ Accessible through Java API‣ FUSE (filesystem in user space) driver available to mount as regular FS‣ C API available‣ Basic command line tools in Hadoop distribution‣ Web interface
  18. 18. HDFS interaction‣ File creation, directory listing and other meta data actions go through the master node (e.g. ls, du, fsck, create file)‣ Data goes directly to and from data nodes (read, write, append)‣ Local read path optimization: clients located on same machine as data node will always access local replica when possible
  19. 19. Hadoop FileSystem (HDFS) Name Node /some/file /foo/bar HDFS client create file read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
  20. 20. HDFS daemons: NameNode‣ Filesystem master node‣ Keeps track of directories, files and block locations‣ Assigns blocks to data nodes‣ Keeps track of live nodes (through heartbeats)‣ Initiates re-replication in case of data node loss‣ Block meta data is held in memory • Will run out of memory when too many files exist‣ Is a SINGLE POINT OF FAILURE in the system • Some solutions exist
  21. 21. HDFS daemons: DataNode‣ Filesystem worker node / “Block server”‣ Uses underlying regular FS for storage (e.g. ext3) • Takes care of distribution of blocks across disks • Don’t use RAID • More disks means more IO throughput‣ Sends heartbeats to NameNode‣ Reports blocks to NameNode (on startup)‣ Does not know about the rest of the cluster (shared nothing)
  22. 22. Things to know about HDFS‣ HDFS is write once, read many • But has append support in newer versions‣ Has built in compression at the block level‣ Does end-to-end checksumming on all data‣ Has tools for parallelized copying of large amounts of data to other HDFS clusters (distcp)‣ Provides a convenient file format to gather lots of small files into a single large one • Remember the NameNode running out of memory with too many files?‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used for batch operations • Optimized for sequential reads, not random access
  23. 23. Hadoop Sequence Files‣ Special type of file to store Key-Value pairs‣ Stores keys and values as byte arrays‣ Uses length encoded bytes as format‣ Often used as input or output format for MapReduce jobs‣ Has built in compression on values
  24. 24. Example: command directory listingfriso@fvv:~/java$ hadoop fs -ls /Found 3 itemsdrwxr-xr-x - friso supergroup 0 2011-03-31 17:06 /Usersdrwxr-xr-x - friso supergroup 0 2011-03-16 14:16 /hbasedrwxr-xr-x - friso supergroup 0 2011-04-18 11:33 /userfriso@fvv:~/java$
  25. 25. Example: NameNode web interface
  26. 26. Example: copy local file to HDFSfriso@fvv:~/Downloads$ hadoop fs -put ./some-tweets.json tweets-data.json
  27. 27. MapReduce Friso van Vollenhoven fvanvollenhoven@xebia.comMassively parallelizablecomputing
  28. 28. MapReduce, the algorithm Input data: Required output:
  29. 29. Map: extract something useful from each record KEYS VALUES map void map(recordNumber, record) { key = record.findColorfulShape(); map value = record.findGrayShapes(); emit(key, value); map } map map map map map
  30. 30. Framework sorts all KeyValue pairs by Key KEYS VALUES KEYS VALUES
  31. 31. Reduce: process values for each keyKEYS VALUES KEYS VALUES reduce reduce void reduce(key, values) { reduce allGrayShapes = []; foreach (value in values) { allGrayShapes.push(value); } emit(key, allGrayShapes); }
  32. 32. MapReduce, the algorithm KEYS VALUES KEYS VALUES KEYS VALUES map reduce map map reduce map map reduce map map map
  33. 33. Hadoop MapReduce: parallelized on top of HDFS‣ Job input comes from files on HDFS • Typically sequence files • Other formats are possible; requires specialized InputFormat implementation • Built in support for text files (convenient for logs, csv, etc.) • Files must be splittable for parallelization to work - Not all compression formats have this property (e.g. gzip)
  34. 34. MapReduce daemons: JobTracker‣ MapReduce master node‣ Takes care of scheduling and job submission‣ Splits jobs into tasks (Mappers and Reducers)‣ Assigns tasks to worker nodes‣ Reassigns tasks in case of failure‣ Keeps track of job progress‣ Keeps track of worker nodes through heartbeats
  35. 35. MapReduce daemons: TaskTracker‣ MapReduce worker process‣ Starts Mappers en Reducers assigned by JobTracker‣ Sends heart beats to the JobTracker‣ Sends task progress to the JobTracker‣ Does not know about the rest of the cluster (shared nothing)
  36. 36. Hadoop MapReduce: parallelized on top of HDFS
  37. 37. Hadoop MapReduce: Mapper side‣ Each mapper processes a piece of the total input • Typically blocks that reside on the same machine as the mapper (local datanode)‣ Mappers sort output by key and store it on the local disk • If the mapper output does not fit in RAM, on disk merge sort happens
  38. 38. Hadoop MapReduce: Reducer side‣ Reducers collect sorted input KeyValue pairs over the network from Mappers • Reducer performs (on disk) merge on inputs from different mappers‣ Reducer calls the reduce method for each unique key • List of values for each key is read from local disk (the result of the merge) • Values do not need to fit in RAM - Reduce methods that need a global view, need enough RAM to fit all values for a key‣ Reducer writes output KeyValue pairs to HDFS • Typically blocks go to local data node
  39. 39. Hadoop MapReduce: parallelized on top of HDFS
  40. 40. <PLUG> Summer Classes Big data crunching using Hadoop and other NoSQL tools • Write Hadoop MapReduce jobs in Java • Run on a actual cluster pre-loaded with several datasets • Create a simple application or visualization with the result • Learn about Hadoop without the hassle of building a production cluster first • Have lots of fun! Dates: July 12, August 10 Only € 295,= for a full day course http://www.xebia.com/summerclasses/bigdata </PLUG>