• Save
Apache hadoop: POSH Meetup Palo Alto, CA April 2014
Upcoming SlideShare
Loading in...5
×
 

Apache hadoop: POSH Meetup Palo Alto, CA April 2014

on

  • 386 views

A presentation on Apache Hadoop in Palo Alto, CA for POSH (http://www.meetup.com/Pivotal-Open-Source-Hub/)

A presentation on Apache Hadoop in Palo Alto, CA for POSH (http://www.meetup.com/Pivotal-Open-Source-Hub/)

Statistics

Views

Total Views
386
Views on SlideShare
384
Embed Views
2

Actions

Likes
1
Downloads
0
Comments
0

2 Embeds 2

https://twitter.com 1
https://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Apache hadoop: POSH Meetup Palo Alto, CA April 2014 Apache hadoop: POSH Meetup Palo Alto, CA April 2014 Presentation Transcript

  • 1© Copyright 2014 Pivotal. All rights reserved. 1© Copyright 2014 Pivotal. All rights reserved. Intro to Hadoop: Hype or Reality – you decide kcrocker@gopivotal.com Pivotal Meet-up Kevin Crocker, Consulting Instructor, Pivotal Academy March 19, 2014
  • 2© Copyright 2014 Pivotal. All rights reserved. Why is this Meet-up necessary •  What is the future of enterprise data architecture? –  The explosion of data –  Volume, Variety, Velocity –  Overruns traditional data stores –  What is the business value of collecting all this data?
  • 3© Copyright 2014 Pivotal. All rights reserved. Volume •  At a recent data conference, one participant told the audience that they collected 7 PB of data a day – and generated another 7 PB of data analytics •  That’s 63 racks! A day! X 2 •  What do we even call that amount of data? –  Data Warehouse(s), Data Store(s) –  New Term: Data Lake
  • 4© Copyright 2014 Pivotal. All rights reserved. Variety •  At the same data conference, another presenter participated in a study using wearable medical technology to monitor health –  Collected 1 million readings a day = 12 readings a second –  when was the last time you had YOUR blood pressure checked? •  Toronto – so many sensors they can track millions of cell phones over 400 square miles – 24x7
  • 5© Copyright 2014 Pivotal. All rights reserved. Velocity •  Ingesting this amount of data is difficult •  Analyzing this amount of data in traditional ways is also difficult –  A client recently told me that it used to take 3 weeks for them to analyze the data from their sensors, now they do it in 3 hours
  • 6© Copyright 2014 Pivotal. All rights reserved. Business Value •  Wall Street Journal – those businesses in Toronto pay to get summary reports of all that data and then gear their marketing campaigns to drive new revenue
  • 7© Copyright 2014 Pivotal. All rights reserved. The Data Lake Dream, Forbes, 01/14/2014 •  In an article published in Forbes, the author mentions the term Data Lake and the technology that addresses the problem of big data => Hadoop •  Four levels of Hadoop Maturity –  Life Before Hadoop -> Hadoop is Introduced -> Growing the Data Lake -> Data Lake and Application Cloud
  • 8© Copyright 2014 Pivotal. All rights reserved. So – Let’s talk about Hadoop •  Hadoop Overview –  Core Elements: HDFS and MapReduce –  Ecosystem •  HDFS Architecture •  Hadoop MapReduce •  Hadoop Ecosystem •  MapReduce Primer •  Buckle up!
  • 9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2014 Pivotal. All rights reserved. Hadoop Overview
  • 10© Copyright 2014 Pivotal. All rights reserved. Hadoop Core •  Based on two Google papers in 2003/4 – Google File System and MapReduce •  Spawned off Nutch open-source web-search because of the need to store the data •  Open-source Apache project out of Yahoo! in January 2006 •  Distributed fault-tolerant data storage (distribution and replication of resources) and distributed batch processing (not for random reads/writes, or updates) •  Provides linear scalability on commodity hardware •  Adopted by many: –  Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more http://wiki.apache.org/hadoop/PoweredBy •  Hadoop uses data redundancy rather than backup strategies
  • 11© Copyright 2014 Pivotal. All rights reserved. Hadoop Overview •  Consists of: –  Key sub-projects •  Hadoop Common: Common utilities/tools for all Hadoop components/sub-projects •  HDFS: A reliable, high-bandwidth, distributed file system •  Map/Reduce: A programming framework to process large datasets •  YARN –  Other key Apache projects in Hadoop ecosystem •  Avro: A data serialization system •  Hbase/Cassandra: A scalable, distributed no-sql databases, supports structured data storage for large tables. •  Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. •  Pig: A high-level data-flow language and execution framework for parallel computation. •  ZooKeeper: A high-performance coordination service for distributed application •  Latest version of Hadoop; –  Stable and widely used latest version – V1 => 1.2.1, V2 => 2.2.0
  • 12© Copyright 2014 Pivotal. All rights reserved. Why? •  Bottom line: –  Flexible –  Scalable –  Inexpensive
  • 13© Copyright 2014 Pivotal. All rights reserved. Overview •  Great at –  Reliable storage for multi-petabyte data sets –  Batch queries and analytics –  Complex hierarchical data structures with changing schemas, unstructured and structured data •  Not so great at –  Changes to files (can’t do it…) – not OLTP –  Low-latency responses –  Analyst usability •  This is less of a concern now due to higher-level languages
  • 14© Copyright 2014 Pivotal. All rights reserved. Data Structure •  Bytes! And more Bytes! (Peta) •  No more ETL necessary??? •  Store data now, process later •  Structure (schema) on read –  Built-in support for common data types and formats –  Extendable –  Flexible
  • 15© Copyright 2014 Pivotal. All rights reserved. Versioning •  Version 0.20.x, 0.21.x, 0.22.x, 0.23.x1.x.x –  Two main MR packages: •  org.apache.hadoop.mapred (deprecated) •  org.apache.hadoop.mapreduce (new hotness) •  Version 2.2.0, GA Oct 2013 –  NameNode HA –  YARN – Next Gen MapReduce –  HDFS Federation, Snapshots
  • 16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture
  • 17© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture
  • 18© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture (Master/Worker) •  HDFS Master: “Namenode” –  Manages the filesystem namespace –  Controls read/write access to files –  Serves open/close/rename file requests from client –  Manages block replication (rack-aware block placement, auto re-replication) –  Checkpoints namespace and journals namespace changes for reliability •  HDFS Workers: “Datanodes” –  Serve read/write requests from clients –  Perform replication tasks upon instruction by Namenode –  Periodically validate the data checksum •  HDFS Client –  Interface available in Java, C, and command line. –  Client computes and validates checksum stored by Datanode for data integrity check (if block is corrupt, then other replica is accessed)
  • Hadoop Distributed File System Data Model: •  Data is organized into files and directories •  Files are divided into uniformly-sized blocks and distributed across cluster nodes •  Blocks are replicated to handle hardware failure •  Filesystem keeps checksums of data for corruption detection and recovery •  Read requests are always served from closest replica •  Not strictly POSIX-compliant
  • 20© Copyright 2014 Pivotal. All rights reserved. Hadoop Distributed File System •  Distributed, Fault-Tolerant & Scalable (petabyte) File System: •  Designed to run on commodity hardware Hardware failure is a norm (RAID-1 - Block level replication) •  High throughput for Streaming/Sequential data access As opposed to low latency for random I/O •  Tuned for smaller number of large size data files •  Simple Coherency model (Write once, read multiple times) Append data to a file is supported in 0.19 •  Support for scalable data processing Exposes metadata as # of block replicas and their locations etc., for scheduling computations closer to data •  Portability across heterogeneous HW & SW platforms File system written in Java •  High Availability and Namespace federation support (2.0.x-alpha)
  • 21© Copyright 2014 Pivotal. All rights reserved. HDFS Overview •  Hierarchical UNIX-like file system for data storage –  sort of (files, folders, permissions, users, groups) … but it is a virtual file system •  Splitting of large files into blocks •  Distribution and replication of blocks to nodes •  Two key services –  Master NameNode –  Many DataNodes •  Checkpoint Node (Secondary NameNode)
  • 22© Copyright 2014 Pivotal. All rights reserved. NameNode •  Single master service for HDFS •  Single point of failure (HDFS 1.x; not 2.x) •  Stores file to block to location mappings in the namespace •  All transactions are logged to disk •  NameNode startup reads namespace image and logs
  • 23© Copyright 2014 Pivotal. All rights reserved. Checkpoint Node (Secondary NN) •  Performs checkpoints of the NameNode’s namespace and logs •  Not a hot backup! 1.  Loads up namespace 2.  Reads log transactions to modify namespace 3.  Saves namespace as a checkpoint
  • 24© Copyright 2014 Pivotal. All rights reserved. DataNode •  Stores blocks on local disk •  Sends frequent heartbeats to NameNode •  Sends block reports to NameNode (all the block IDs it has, checksums, etc) •  Clients connect to DataNode for I/O
  • 25© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNode 1 Client 2 A1 3 A2 A3 A4 Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially Writes blocks to DataNode
  • 26© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 DataNodes replicate data blocks, orchestrated by the NameNode
  • 27© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Reads DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 1 2 3 Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode
  • 28© Copyright 2014 Pivotal. All rights reserved. DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 Client connects to another node serving that block How HDFS Works - Failure
  • 29© Copyright 2014 Pivotal. All rights reserved. Block Replication •  Default of three replica’s •  Rack-aware system –  One block on same rack –  One block on same rack, different host –  One block on another rack •  Automatic re-copy by NameNode, as needed Rack 1 DN DN DN … Rack 2 DN DN DN …
  • 30© Copyright 2014 Pivotal. All rights reserved. HDFS 2.0 Features •  NameNode High-Availability (HA) –  Two redundant NameNodes in active/passive configuration –  Manual or automated failover •  NameNode Federation –  Multiple independent NameNodes using the same collection of DataNodes
  • 31© Copyright 2014 Pivotal. All rights reserved. 31© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce
  • •  Programming model processing list of key/value pairs •  Map function: processes input key/value pairs and produces set of intermediate key/value pairs. •  Reduce function: merges all intermediate values associated with the same intermediate key and produces output key/value pairs. Map-Reduce Programming Model Input (k1, v1) Output K2, List(V3) Intermediate Output List (K2, V2) Reduce Sort or Group by K2 (K2, List(V2)) Map
  • Application Writer Specifies: • Map and Reduce classes • Input data on HDFS • Input/Output format classes (optional) Workflow: •  Input phase generates a number of logical FileSplits from input files • One Map task is created per logical file split •  Each Map task loads Map class and executes map function to transform input kv-pairs into a new set of kv-pairs •  Record reader class supplied part of InputFormat reads a input record as k-v pair •  Map output keys are stored on local disk in sorted partitions, one per task •  One invocation of map function per k-v pair from an associated input split •  Each Reduce task fetches map output (from its associated partition) as soon as map task finishes its processing •  Map outputs are merged •  One invocation of reduce function per distinct key and its associated list of values •  Output k-v pairs are stored on HDFS, one file per reduce task •  Framework handles task scheduling and recovery. Km+1…N Output Part-0 Output Part-1 Input Split 0 Input HDFS File K1..m K1..mK1..m Km+1…N Km+1…N Sorted Partitions Map 0 Map 1 Map 2 Sorted Partitions Sorted Partitions Reduce 0 Reduce 1 Shuffle Input Split 2 Input Split 1 Merge & Sort Merge & Sort Parallel Execution Model for Map-Reduce Km+1…N
  • 34© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce 1.x •  Moves the code to the data •  JobTracker –  Master service to monitor jobs •  TaskTracker –  Multiple services to run tasks in parallel –  Same physical machine as a DataNode •  A job contains many tasks (One data block equals one task ) •  A task contains one or more task attempts (success = good, failed task attempts are given to another Task Tracker for processing: 4 single failed task attempts = one failed job)
  • 35© Copyright 2014 Pivotal. All rights reserved. JobTracker •  Monitors job and task progress •  Issues task attempts to TaskTrackers •  Re-tries failed task attempts •  Four failed attempts = one failed job •  Schedules jobs in FIFO order –  Fair Scheduler •  Single point of failure for MapReduce
  • 36© Copyright 2014 Pivotal. All rights reserved. TaskTrackers •  Runs on same node as DataNode service •  Sends heartbeats and task reports to JobTracker •  Configurable number of map and reduce slots •  Runs map and reduce task attempts –  Separate JVM!
  • 37© Copyright 2014 Pivotal. All rights reserved. Exploiting Data Locality •  JobTracker will schedule task on a TaskTracker that is local to the block –  3 options! Because 3 replica’s •  If TaskTracker is busy, selects TaskTracker on same rack –  Many options! •  If still busy, chooses an available TaskTracker at random – Rare!
  • 38© Copyright 2014 Pivotal. All rights reserved. YARN (aka MapReduce 2) •  Abstract framework for distributed application development •  Split functionality of JobTracker into two components –  ResourceManager –  ApplicationMaster •  TaskTracker becomes NodeManager –  Containers instead of map and reduce slots •  Configurable amount of memory per NodeManager
  • 39© Copyright 2014 Pivotal. All rights reserved. How MapReduce Works DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTracker 1 Client 4 2 B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 3 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D Client submits job to JobTracker JobTracker submits tasks to TaskTrackers Job output is written to DataNodes w/replication JobTracker reports metrics back to client
  • 40© Copyright 2014 Pivotal. All rights reserved. DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTrackerClient B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D How MapReduce Works - Failure JobTracker assigns task to different node
  • 41© Copyright 2014 Pivotal. All rights reserved. MapReduce 2.x on YARN •  MapReduce API has not changed –  Rebuild required to upgrade from 1.x to 2.x •  MapReduce History Server to store… history
  • 42© Copyright 2014 Pivotal. All rights reserved. YARN – Architecture •  Client •  Submit Job/applications •  Resource Manager •  Schedule resources •  AppMaster •  Manage/monitor lifecycle of the M/R Job •  Node Manager •  Manage/monitor task lifecycle •  Container •  Task JVM •  No distinction between map and reduce tasks
  • 43© Copyright 2014 Pivotal. All rights reserved. YARN – Map/Reduce
  • 44© Copyright 2014 Pivotal. All rights reserved. 44© Copyright 2014 Pivotal. All rights reserved. Hadoop Ecosystem
  • 45© Copyright 2014 Pivotal. All rights reserved. Hadoop Ecosystem •  Core Technologies –  Hadoop Distributed File System –  Hadoop MapReduce •  Many other tools… –  Which I will be describing… now
  • 46© Copyright 2014 Pivotal. All rights reserved. Moving Data •  Sqoop –  Moving data between RDBMS and HDFS –  Say, migrating MySQL tables to HDFS •  Flume –  Streams event data from sources to sinks –  Say, weblogs from multiple servers into HDFS
  • 47© Copyright 2014 Pivotal. All rights reserved. Flume Architecture
  • 48© Copyright 2014 Pivotal. All rights reserved. Higher Level APIs •  Pig –  Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS •  Hive –  Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS
  • 49© Copyright 2014 Pivotal. All rights reserved. Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';
  • 50© Copyright 2014 Pivotal. All rights reserved. Key/Value Stores •  HBase •  Accumulo •  Implementations of Google’s Big Table for HDFS •  Provides random, real-time access to big data •  Supports updates and deletes of key/value pairs
  • 51© Copyright 2014 Pivotal. All rights reserved. HBase Architecture MasterZooKeeper RegionServer Region Store StoreFile MemStore StoreFile Store StoreFile MemStore StoreFile Client HDFS RegionServer Region Store StoreFile MemStore StoreFile Store StoreFile MemStore StoreFile
  • 52© Copyright 2014 Pivotal. All rights reserved. Data Structure •  Avro –  Data serialization system designed for the Hadoop ecosystem –  Expressed as JSON •  Parquet –  Compressed, efficient columnar storage for Hadoop and other systems
  • 53© Copyright 2014 Pivotal. All rights reserved. Scalable Machine Learning •  Mahout –  Library for scalable machine learning written in Java –  Very robust examples! –  Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more
  • 54© Copyright 2014 Pivotal. All rights reserved. Workflow Management •  Oozie –  Scheduling system for Hadoop Jobs –  Support for: •  Java MapReduce •  Streaming MapReduce •  Pig, Hive, Sqoop, Distcp •  Any ol’ Java or shell script program
  • 55© Copyright 2014 Pivotal. All rights reserved. Real-time Stream Processing •  Storm –  Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts –  Scalable and fault- tolerant, with guaranteed processing of data –  Benchmarks of over a million tuples processed per second per node
  • 56© Copyright 2014 Pivotal. All rights reserved. Distributed Application Coordination •  ZooKeeper –  An effort to develop and maintain an open-source server which enables highly reliable distributed coordination –  Designed to be simple, replicated, ordered, and fast –  Provides configuration management, distributed synchronization, and group services for applications
  • 57© Copyright 2014 Pivotal. All rights reserved. ZooKeeper Architecture
  • 58© Copyright 2014 Pivotal. All rights reserved. Hadoop Streaming •  Can define Mapper and Reduce using Unix text filters –  Typically use grep, sed, python, or perl scripts •  Format for input and output is: key t value n •  Allows for easy debugging and experimentation •  Slower than Java programs •  bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh - reducer streamingReducer.sh –  Mapper: /bin/sed -e 's| |n|g' | /bin/grep . –  Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "t" $1}'
  • 59© Copyright 2014 Pivotal. All rights reserved. Hadoop Streaming Architecture JobTracker (Master) TaskTracker (Slave) Map Task TaskTracker (Slave) Mapper Executable I/P HDFS File STDOUT STDIN Reduce Task Reducer Executable STDOUT STDIN O/P HDFS File K t V http://hadoop.apache.org/docs/stable/streaming.html
  • 60© Copyright 2014 Pivotal. All rights reserved. SQL on Hadoop •  Apache Drill •  Cloudera Impala •  Hive Stinger •  Pivotal HAWQ •  MPP execution of SQL queries against HDFS data
  • 61© Copyright 2014 Pivotal. All rights reserved. That’s a lot of projects •  I am likely missing several (Sorry, guys!) •  Each cropped up to solve a limitation of Hadoop Core •  Know your ecosystem •  Pick the right tool for the right job
  • 62© Copyright 2014 Pivotal. All rights reserved. Sample Architecture HDFS Flume Agent Flume Agent Flume Agent MapReduce Pig HBase Storm Website Oozie Webserver Sales Call Center SQL SQL
  • 63© Copyright 2014 Pivotal. All rights reserved. 63© Copyright 2014 Pivotal. All rights reserved. MapReduce Primer
  • 64© Copyright 2014 Pivotal. All rights reserved. MapReduce Paradigm •  Data processing system with two key phases •  Map –  Perform a map function on input key/value pairs to generate intermediate key/value pairs •  Reduce –  Perform a reduce function on intermediate key/value groups to generate output key/value pairs •  Groups created by sorting map output
  • Reduce Task 0 Reduce Task 1 Map Task 0 Map Task 1 Map Task 2 (0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun") ("hadoop", 1) ("is", 1) ("fun", 1) ("I", 1) ("love", 1) ("hadoop", 1) ("Pig", 1) ("is", 1) ("more", 1) ("fun", 1) ("hadoop", {1,1}) ("is", {1,1}) ("fun", {1,1}) ("love", {1}) ("I", {1}) ("Pig", {1}) ("more", {1}) ("hadoop", 2) ("fun", 2) ("love", 1) ("I", 1) ("is", 2) ("Pig", 1) ("more", 1) SHUFFLE AND SORT Map Input Map Output Reducer Input Groups Reducer Output
  • 66© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce Components •  Map Phase –  Input Format –  Record Reader –  Mapper –  Combiner –  Partitioner •  Reduce Phase –  Shuffle –  Sort –  Reducer –  Output Format –  Record Writer
  • 67© Copyright 2014 Pivotal. All rights reserved. Writable Interfaces public interface Writable {" " void write(DataOutput out);" void readFields(DataInput in);" }" " public interface WritableComparable<T> extends Writable, Comparable<T> {" }" •  BooleanWritable •  BytesWritable •  ByteWritable •  DoubleWritable •  FloatWritable •  IntWritable •  LongWritable •  NullWritable •  Text
  • 68© Copyright 2014 Pivotal. All rights reserved. InputFormat " " public abstract class InputFormat<K, V> {" " public abstract List<InputSplit> getSplits(JobContext context);" " public abstract RecordReader<K, V>" "createRecordReader(InputSplit split, TaskAttemptContext context);" }"
  • 69© Copyright 2014 Pivotal. All rights reserved. RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {" " public abstract void initialize(InputSplit split, TaskAttemptContext context);" " public abstract boolean nextKeyValue();" " public abstract KEYIN getCurrentKey();" " public abstract VALUEIN getCurrentValue();" " public abstract float getProgress();" " public abstract void close();" }"
  • 70© Copyright 2014 Pivotal. All rights reserved. Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {" protected void setup(Context context) { /* NOTHING */ }" protected void cleanup(Context context) { /* NOTHING */ }" " protected void map(KEYIN key, VALUEIN value, Context context) {" context.write((KEYOUT) key, (VALUEOUT) value);" }" " public void run(Context context) {" setup(context);" while (context.nextKeyValue())" map(context.getCurrentKey(), context.getCurrentValue(), context);" cleanup(context);" }" }"
  • 71© Copyright 2014 Pivotal. All rights reserved. Partitioner " " public abstract class Partitioner<KEY, VALUE> {" " public abstract int getPartition(KEY key, VALUE value, int numPartitions);" " }" " •  Default HashPartitioner uses key’s hashCode() % numPartitions
  • 72© Copyright 2014 Pivotal. All rights reserved. Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {" protected void setup(Context context) { /* NOTHING */ }" protected void cleanup(Context context) { /* NOTHING */ }" " protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {" for (VALUEIN value : values)" context.write((KEYOUT) key, (VALUEOUT) value);" }" " public void run(Context context) {" setup(context);" while (context.nextKey())" reduce(context.getCurrentKey(), context.getValues(), context);" cleanup(context);" }" }"
  • 73© Copyright 2014 Pivotal. All rights reserved. OutputFormat " " public abstract class OutputFormat<K, V> {" " public abstract RecordWriter<K, V>" " " "getRecordWriter(TaskAttemptContext context);" " public abstract void checkOutputSpecs(JobContext context);" " public abstract OutputCommitter" " " "getOutputCommitter(TaskAttemptContext context);" }"
  • 74© Copyright 2014 Pivotal. All rights reserved. RecordWriter " " public abstract class RecordWriter<K, V> {" " public abstract void write(K key, V value);" " public abstract void close(TaskAttemptContext context);" }"
  • 75© Copyright 2014 Pivotal. All rights reserved. Some M/R Concepts / knobs •  Configuration –  {hdfs,yarn,mapred}-default.xml -- default config (contain both services & client config) –  {hdfs,yarn,mapred}-site.xml -- Service config used for cluster specific over-rides, –  {hdfs,yarn,mapred}-client.xml -- Client specific config •  Input/Output Formats –  TextFileInputFormat, KeyValueTextFileInputFormat, NLineInputFormat, SequenceFileInputFormat –  Pluggable input/output formats provide ability for Jobs to read/write data in different formats –  Major function •  getSplits •  RecordReader •  Schedulers –  Pluggable resource scheduler used by Resource Manager –  Default, Capacity Scheduler & Fair scheduler •  Combiner –  Combine individual map output before sending to reducer –  Lowers intermediate data •  Partitioner –  Pluggable class to partition the map output among number of reducers
  • 76© Copyright 2014 Pivotal. All rights reserved. Some M/R knobs •  Compression –  Enable compression of Map/Reduce output –  Gzip, lzo, bz2 codecs available with framework •  Counters –  Ability to keep track of various job statistics e.g. num bytes read, written –  Available for each task and also aggregated per job. –  Job can write its own custom counters •  Speculative Executions –  Provides task recovery against hardware issues •  Distributed cache –  Ability to make job specific data available to each •  Tool – M/R application helper classes, Support ability for job to accept generic options, e.g. –  -conf <configuration file> specify an application configuration file –  -D <property=value> use value for given property –  -fs <local|namenode:port> specify a namenode –  -jt <local|jobtracker:port> specify a job tracker –  -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster –  -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. –  -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
  • 77© Copyright 2014 Pivotal. All rights reserved. 77© Copyright 2014 Pivotal. All rights reserved. Word Count Example
  • 78© Copyright 2014 Pivotal. All rights reserved. Problem •  Count the number of times each word is used in a body of text •  Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)
  • 79© Copyright 2014 Pivotal. All rights reserved. Word Count Example
  • 80© Copyright 2014 Pivotal. All rights reserved. Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ " private final static IntWritable ONE = new IntWritable(1);" private Text word = new Text();" " public void map(LongWritable key, Text value, Context context) {" String line = value.toString();" StringTokenizer tokenizer = new StringTokenizer(line);" " while (tokenizer.hasMoreTokens()) {" word.set(tokenizer.nextToken());" context.write(word, ONE);" }" }" }"
  • 81© Copyright 2014 Pivotal. All rights reserved. Shuffle and Sort P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 2 3 P0 P1 P2 P3 Reducer 0 Reducer 1 Reducer 2 Reducer 3 Mapper 0 Mapper 1 Mapper 2 Mapper 3 Mapper outputs to a single logically partitioned file Reducers copy their parts Reducer merges partitions, sorting by key
  • 82© Copyright 2014 Pivotal. All rights reserved. Reducer Code public class IntSumReducer" "extends Reducer<Text, LongWritable, Text, IntWritable> {" private IntWritable outvalue = new IntWritable();" private int sum = 0;" " public void reduce(Text key, Iterable<IntWritable> values, Context context) {" sum = 0;" for (IntWritable val : values) {" sum += val.get();" }" outvalue.set(sum);" context.write(key, outvalue);" }" }"
  • 83© Copyright 2014 Pivotal. All rights reserved. So what’s so hard about it? MapReduce that’s a tiny box All the problems you'll ever have ever
  • 84© Copyright 2014 Pivotal. All rights reserved. So what’s so hard about it? •  MapReduce is a limitation •  Entirely different way of thinking •  Simple processing operations such as joins are not so easy when expressed in MapReduce •  Proper implementation is not so easy •  Lots of configuration and implementation details for optimal performance –  Number of reduce tasks, data skew, JVM size, garbage collection
  • 85© Copyright 2014 Pivotal. All rights reserved. So what does this mean for you? •  Hadoop is written primarily in Java •  Components are extendable and configurable •  Custom I/O through Input and Output Formats –  Parse custom data formats –  Read and write using external systems •  Higher-level tools enable rapid development of big data analysis
  • 86© Copyright 2014 Pivotal. All rights reserved. Resources, Wrap-up, etc. •  http://hadoop.apache.org •  Very supportive community •  Plenty of resources available to learn more –  Blogs –  Email lists –  Books –  Shameless Plug -- MapReduce Design Patterns
  • 87© Copyright 2014 Pivotal. All rights reserved. Getting Started •  Pivotal HD Single-Node VM and Community Edition –  http://gopivotal.com/pivotal-products/data/pivotal-hd •  For the brave and bold -- Roll-your-own! –  http://hadoop.apache.org/docs/current
  • 88© Copyright 2014 Pivotal. All rights reserved. Acknowledgements •  Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation •  Cloudera Impala is a trademark of Cloudera •  Parquet is copyright Twitter, Cloudera, and other contributors •  Storm is licensed under the Eclipse Public License
  • 89© Copyright 2014 Pivotal. All rights reserved. •  Talk to us on Twitter: @mewzherder (Tamao, not me) •  Sign up for more Hadoop –  http://bit.ly/POSH0018 •  Pivotal Education –  http://www.gopivotal.com/training Learn More. Stay Connected.
  • 90© Copyright 2014 Pivotal. All rights reserved. Questions ??