Hadoop andMapReduce Friso van Vollenhoven email@example.comThe workings of the elephant
Data everywhere‣ Global data volume grows exponentially‣ Information retrieval is BIG business these days‣ Need means of economically storing and processing large data sets
Opportunity‣ Commodity hardware is ultra cheap‣ CPU and storage even cheaper
Traditional solution‣ Store data in a (relational) database‣ Run batch jobs for processing
Problems with existing solutions‣ Databases are seek heavy; B-tree gives log(n) random accesses per update‣ Seeks are wasted time, nothing of value happens during seeks‣ Databases do not play well with commoditized hardware (SANs and 16 CPU machines are not in the price sweet spot of performance / $)‣ Databases were not built with horizontal scaling in mind
Solution: sort/merge vs. updating the B-tree‣ Eliminate the seeks, only sequential reading / writing‣ Work with batches for efficiency‣ Parallelize work load‣ Distribute processing and storage
History‣ 2000: Apache Lucene: batch index updates and sort/merge with on disk index‣ 2002: Apache Nutch: distributed, scalable open source web crawler; sort/merge optimization applies‣ 2004: Google publishes GFS and MapReduce papers‣ 2006: Apache Hadoop: open source Java implementation of GFS and MR to solve Nutch’ problem; later becomes standalone project‣ 2011: We’re here learning about it!
Hadoop foundations‣ Commodity hardware (3K - 7K $ machines)‣ Only sequential reads / writes‣ Distribution of data and processing across cluster‣ Built in reliability / fault tolerance / redundancy‣ Disk based, does not require data or indexes to fit in RAM‣ Apache licensed, Open Source Software
The US governmentbuilds their ﬁnger printsearch index usingHadoop.
The contents for the People You May Know feature iscreated by a chain of many MapReduce jobs thatrun daily. The jobs are reportedly a combination ofgraph traversal, clustering and assisted machinelearning.
Amazon’s Frequently Bought Together and Customers Who Bought This Item AlsoBought features are brought to you by MapReduce jobs. Recommendationbased on large sales transaction datasets is a much seen use case.
Top Chartsgenerated dailybased on millionsof users’ listeningbehavior.
Top searches used for auto-completion are re-generated daily by aMapReduce job using all searches for the past couple of days.Popularity for search terms can be based on counts, but also trendingand correlation with other datasets (e.g. trending on social media,news, charts in case of music and movies, best seller lists, etc.)
HadoopFilesystem Friso van Vollenhoven firstname.lastname@example.orgHDFS
HDFS overview‣ Distributed filesystem‣ Consists of a single master node and multiple (many) data nodes‣ Files are split up blocks (typically 64MB)‣ Blocks are spread across data nodes in the cluster‣ Each block is replicated multiple times to different data nodes in the cluster (typically 3 times)‣ Master node keeps track of which blocks belong to a file
HDFS interaction‣ Accessible through Java API‣ FUSE (filesystem in user space) driver available to mount as regular FS‣ C API available‣ Basic command line tools in Hadoop distribution‣ Web interface
HDFS interaction‣ File creation, directory listing and other meta data actions go through the master node (e.g. ls, du, fsck, create file)‣ Data goes directly to and from data nodes (read, write, append)‣ Local read path optimization: clients located on same machine as data node will always access local replica when possible
Hadoop FileSystem (HDFS) Name Node /some/file /foo/bar HDFS client create ﬁle read data Date Node Date Node Date Node write data DISK DISK DISK Node local HDFS client DISK DISK DISK replicate DISK DISK DISK read data
HDFS daemons: NameNode‣ Filesystem master node‣ Keeps track of directories, files and block locations‣ Assigns blocks to data nodes‣ Keeps track of live nodes (through heartbeats)‣ Initiates re-replication in case of data node loss‣ Block meta data is held in memory • Will run out of memory when too many files exist‣ Is a SINGLE POINT OF FAILURE in the system • Some solutions exist
HDFS daemons: DataNode‣ Filesystem worker node / “Block server”‣ Uses underlying regular FS for storage (e.g. ext3) • Takes care of distribution of blocks across disks • Don’t use RAID • More disks means more IO throughput‣ Sends heartbeats to NameNode‣ Reports blocks to NameNode (on startup)‣ Does not know about the rest of the cluster (shared nothing)
Things to know about HDFS‣ HDFS is write once, read many • But has append support in newer versions‣ Has built in compression at the block level‣ Does end-to-end checksumming on all data‣ Has tools for parallelized copying of large amounts of data to other HDFS clusters (distcp)‣ Provides a convenient file format to gather lots of small files into a single large one • Remember the NameNode running out of memory with too many files?‣ HDFS is best used for large, unstructured volumes of raw data in BIG files used for batch operations • Optimized for sequential reads, not random access
Hadoop Sequence Files‣ Special type of file to store Key-Value pairs‣ Stores keys and values as byte arrays‣ Uses length encoded bytes as format‣ Often used as input or output format for MapReduce jobs‣ Has built in compression on values
Hadoop MapReduce: parallelized on top of HDFS‣ Job input comes from files on HDFS • Typically sequence files • Other formats are possible; requires specialized InputFormat implementation • Built in support for text files (convenient for logs, csv, etc.) • Files must be splittable for parallelization to work - Not all compression formats have this property (e.g. gzip)
MapReduce daemons: JobTracker‣ MapReduce master node‣ Takes care of scheduling and job submission‣ Splits jobs into tasks (Mappers and Reducers)‣ Assigns tasks to worker nodes‣ Reassigns tasks in case of failure‣ Keeps track of job progress‣ Keeps track of worker nodes through heartbeats
MapReduce daemons: TaskTracker‣ MapReduce worker process‣ Starts Mappers en Reducers assigned by JobTracker‣ Sends heart beats to the JobTracker‣ Sends task progress to the JobTracker‣ Does not know about the rest of the cluster (shared nothing)
Hadoop MapReduce: Mapper side‣ Each mapper processes a piece of the total input • Typically blocks that reside on the same machine as the mapper (local datanode)‣ Mappers sort output by key and store it on the local disk • If the mapper output does not fit in RAM, on disk merge sort happens
Hadoop MapReduce: Reducer side‣ Reducers collect sorted input KeyValue pairs over the network from Mappers • Reducer performs (on disk) merge on inputs from different mappers‣ Reducer calls the reduce method for each unique key • List of values for each key is read from local disk (the result of the merge) • Values do not need to fit in RAM - Reduce methods that need a global view, need enough RAM to fit all values for a key‣ Reducer writes output KeyValue pairs to HDFS • Typically blocks go to local data node
<PLUG> Summer Classes Big data crunching using Hadoop and other NoSQL tools • Write Hadoop MapReduce jobs in Java • Run on a actual cluster pre-loaded with several datasets • Create a simple application or visualization with the result • Learn about Hadoop without the hassle of building a production cluster ﬁrst • Have lots of fun! Dates: July 12, August 10 Only € 295,= for a full day course http://www.xebia.com/summerclasses/bigdata </PLUG>
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.