1© Copyright 2014 Pivotal. All rights reserved. 1© Copyright 2014 Pivotal. All rights reserved.
Intro to Hadoop: Hype or R...
2© Copyright 2014 Pivotal. All rights reserved.
Why is this Meet-up necessary
•  What is the future of enterprise data arc...
3© Copyright 2014 Pivotal. All rights reserved.
Volume
•  At a recent data conference, one participant told
the audience t...
4© Copyright 2014 Pivotal. All rights reserved.
Variety
•  At the same data conference, another presenter
participated in ...
5© Copyright 2014 Pivotal. All rights reserved.
Velocity
•  Ingesting this amount of data is difficult
•  Analyzing this a...
6© Copyright 2014 Pivotal. All rights reserved.
Business Value
•  Wall Street Journal – those businesses in Toronto
pay to...
7© Copyright 2014 Pivotal. All rights reserved.
The Data Lake Dream, Forbes, 01/14/2014
•  In an article published in Forb...
8© Copyright 2014 Pivotal. All rights reserved.
So – Let’s talk about Hadoop
•  Hadoop Overview
–  Core Elements: HDFS and...
9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2014 Pivotal. All rights reserved.
Hadoop Overview
10© Copyright 2014 Pivotal. All rights reserved.
Hadoop Core
•  Based on two Google papers in 2003/4 – Google File System ...
11© Copyright 2014 Pivotal. All rights reserved.
Hadoop Overview
•  Consists of:
–  Key sub-projects
•  Hadoop Common: Com...
12© Copyright 2014 Pivotal. All rights reserved.
Why?
•  Bottom line:
–  Flexible
–  Scalable
–  Inexpensive
13© Copyright 2014 Pivotal. All rights reserved.
Overview
•  Great at
–  Reliable storage for multi-petabyte data sets
–  ...
14© Copyright 2014 Pivotal. All rights reserved.
Data Structure
•  Bytes! And more Bytes! (Peta)
•  No more ETL necessary?...
15© Copyright 2014 Pivotal. All rights reserved.
Versioning
•  Version 0.20.x, 0.21.x, 0.22.x, 0.23.x1.x.x
–  Two main MR ...
16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
17© Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture
18© Copyright 2014 Pivotal. All rights reserved.
HDFS Architecture (Master/Worker)
•  HDFS Master: “Namenode”
–  Manages t...
Hadoop Distributed File System
Data Model:
•  Data is organized into files and directories
•  Files are divided into unifo...
20© Copyright 2014 Pivotal. All rights reserved.
Hadoop Distributed File System
•  Distributed, Fault-Tolerant & Scalable ...
21© Copyright 2014 Pivotal. All rights reserved.
HDFS Overview
•  Hierarchical UNIX-like file system for data storage
–  s...
22© Copyright 2014 Pivotal. All rights reserved.
NameNode
•  Single master service for HDFS
•  Single point of failure (HD...
23© Copyright 2014 Pivotal. All rights reserved.
Checkpoint Node (Secondary NN)
•  Performs checkpoints of the NameNode’s
...
24© Copyright 2014 Pivotal. All rights reserved.
DataNode
•  Stores blocks on local disk
•  Sends frequent heartbeats to N...
25© Copyright 2014 Pivotal. All rights reserved.
How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameN...
26© Copyright 2014 Pivotal. All rights reserved.
How HDFS Works - Writes
DataNode A DataNode B DataNode C DataNode D
NameN...
27© Copyright 2014 Pivotal. All rights reserved.
How HDFS Works - Reads
DataNode A DataNode B DataNode C DataNode D
NameNo...
28© Copyright 2014 Pivotal. All rights reserved.
DataNode A DataNode B DataNode C DataNode D
NameNodeClient
A1 A2 A3 A4 A1...
29© Copyright 2014 Pivotal. All rights reserved.
Block Replication
•  Default of three replica’s
•  Rack-aware system
–  O...
30© Copyright 2014 Pivotal. All rights reserved.
HDFS 2.0 Features
•  NameNode High-Availability (HA)
–  Two redundant Nam...
31© Copyright 2014 Pivotal. All rights reserved. 31© Copyright 2014 Pivotal. All rights reserved.
Hadoop MapReduce
•  Programming model processing list of key/value pairs
•  Map function: processes input key/value pairs and produces set ...
Application Writer Specifies:
• Map and Reduce classes
• Input data on HDFS
• Input/Output format classes (optional)
Workf...
34© Copyright 2014 Pivotal. All rights reserved.
Hadoop MapReduce 1.x
•  Moves the code to the data
•  JobTracker
–  Maste...
35© Copyright 2014 Pivotal. All rights reserved.
JobTracker
•  Monitors job and task progress
•  Issues task attempts to T...
36© Copyright 2014 Pivotal. All rights reserved.
TaskTrackers
•  Runs on same node as DataNode service
•  Sends heartbeats...
37© Copyright 2014 Pivotal. All rights reserved.
Exploiting Data Locality
•  JobTracker will schedule task on a TaskTracke...
38© Copyright 2014 Pivotal. All rights reserved.
YARN (aka MapReduce 2)
•  Abstract framework for distributed application ...
39© Copyright 2014 Pivotal. All rights reserved.
How MapReduce Works
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTra...
40© Copyright 2014 Pivotal. All rights reserved.
DataNode A
A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3
JobTrackerClient
B1 B3 B4 ...
41© Copyright 2014 Pivotal. All rights reserved.
MapReduce 2.x on YARN
•  MapReduce API has not changed
–  Rebuild require...
42© Copyright 2014 Pivotal. All rights reserved.
YARN – Architecture
•  Client
•  Submit Job/applications
•  Resource Mana...
43© Copyright 2014 Pivotal. All rights reserved.
YARN – Map/Reduce
44© Copyright 2014 Pivotal. All rights reserved. 44© Copyright 2014 Pivotal. All rights reserved.
Hadoop Ecosystem
45© Copyright 2014 Pivotal. All rights reserved.
Hadoop Ecosystem
•  Core Technologies
–  Hadoop Distributed File System
–...
46© Copyright 2014 Pivotal. All rights reserved.
Moving Data
•  Sqoop
–  Moving data between RDBMS and HDFS
–  Say, migrat...
47© Copyright 2014 Pivotal. All rights reserved.
Flume Architecture
48© Copyright 2014 Pivotal. All rights reserved.
Higher Level APIs
•  Pig
–  Data-flow language – aptly named PigLatin -- ...
49© Copyright 2014 Pivotal. All rights reserved.
Pig Word Count
A = LOAD '$input';
B = FOREACH A GENERATE FLATTEN(TOKENIZE...
50© Copyright 2014 Pivotal. All rights reserved.
Key/Value Stores
•  HBase
•  Accumulo
•  Implementations of Google’s Big ...
51© Copyright 2014 Pivotal. All rights reserved.
HBase Architecture
MasterZooKeeper
RegionServer
Region
Store
StoreFile
Me...
52© Copyright 2014 Pivotal. All rights reserved.
Data Structure
•  Avro
–  Data serialization system designed for the Hado...
53© Copyright 2014 Pivotal. All rights reserved.
Scalable Machine Learning
•  Mahout
–  Library for scalable machine learn...
54© Copyright 2014 Pivotal. All rights reserved.
Workflow Management
•  Oozie
–  Scheduling system for Hadoop Jobs
–  Supp...
55© Copyright 2014 Pivotal. All rights reserved.
Real-time Stream Processing
•  Storm
–  Open-source project
which runs a ...
56© Copyright 2014 Pivotal. All rights reserved.
Distributed Application Coordination
•  ZooKeeper
–  An effort to develop...
57© Copyright 2014 Pivotal. All rights reserved.
ZooKeeper Architecture
58© Copyright 2014 Pivotal. All rights reserved.
Hadoop Streaming
•  Can define Mapper and Reduce using Unix text filters
...
59© Copyright 2014 Pivotal. All rights reserved.
Hadoop Streaming Architecture
JobTracker (Master)
TaskTracker
(Slave)
Map...
60© Copyright 2014 Pivotal. All rights reserved.
SQL on Hadoop
•  Apache Drill
•  Cloudera Impala
•  Hive Stinger
•  Pivot...
61© Copyright 2014 Pivotal. All rights reserved.
That’s a lot of projects
•  I am likely missing several (Sorry, guys!)
• ...
62© Copyright 2014 Pivotal. All rights reserved.
Sample Architecture
HDFS
Flume
Agent
Flume
Agent
Flume
Agent
MapReduce Pi...
63© Copyright 2014 Pivotal. All rights reserved. 63© Copyright 2014 Pivotal. All rights reserved.
MapReduce Primer
64© Copyright 2014 Pivotal. All rights reserved.
MapReduce Paradigm
•  Data processing system with two key phases
•  Map
–...
Reduce Task 0 Reduce Task 1
Map Task 0 Map Task 1 Map Task 2
(0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more...
66© Copyright 2014 Pivotal. All rights reserved.
Hadoop MapReduce Components
•  Map Phase
–  Input Format
–  Record Reader...
67© Copyright 2014 Pivotal. All rights reserved.
Writable Interfaces
public interface Writable {"
"
void write(DataOutput ...
68© Copyright 2014 Pivotal. All rights reserved.
InputFormat
"
"
public abstract class InputFormat<K, V> {"
"
public abstr...
69© Copyright 2014 Pivotal. All rights reserved.
RecordReader
public abstract class RecordReader<KEYIN, VALUEIN> implement...
70© Copyright 2014 Pivotal. All rights reserved.
Mapper
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {"
protected...
71© Copyright 2014 Pivotal. All rights reserved.
Partitioner
"
"
public abstract class Partitioner<KEY, VALUE> {"
"
public...
72© Copyright 2014 Pivotal. All rights reserved.
Reducer
public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {"
protect...
73© Copyright 2014 Pivotal. All rights reserved.
OutputFormat
"
"
public abstract class OutputFormat<K, V> {"
"
public abs...
74© Copyright 2014 Pivotal. All rights reserved.
RecordWriter
"
"
public abstract class RecordWriter<K, V> {"
"
public abs...
75© Copyright 2014 Pivotal. All rights reserved.
Some M/R Concepts / knobs
•  Configuration
–  {hdfs,yarn,mapred}-default....
76© Copyright 2014 Pivotal. All rights reserved.
Some M/R knobs
•  Compression
–  Enable compression of Map/Reduce output
...
77© Copyright 2014 Pivotal. All rights reserved. 77© Copyright 2014 Pivotal. All rights reserved.
Word Count Example
78© Copyright 2014 Pivotal. All rights reserved.
Problem
•  Count the number of
times each word is used
in a body of text
...
79© Copyright 2014 Pivotal. All rights reserved.
Word Count Example
80© Copyright 2014 Pivotal. All rights reserved.
Mapper Code
public class WordMapper extends Mapper<LongWritable, Text, Te...
81© Copyright 2014 Pivotal. All rights reserved.
Shuffle and Sort
P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3
P0 P0 P...
82© Copyright 2014 Pivotal. All rights reserved.
Reducer Code
public class IntSumReducer"
"extends Reducer<Text, LongWrita...
83© Copyright 2014 Pivotal. All rights reserved.
So what’s so hard about it?
MapReduce
that’s a tiny box
All the problems ...
84© Copyright 2014 Pivotal. All rights reserved.
So what’s so hard about it?
•  MapReduce is a limitation
•  Entirely diff...
85© Copyright 2014 Pivotal. All rights reserved.
So what does this mean for you?
•  Hadoop is written primarily in Java
• ...
86© Copyright 2014 Pivotal. All rights reserved.
Resources, Wrap-up, etc.
•  http://hadoop.apache.org
•  Very supportive c...
87© Copyright 2014 Pivotal. All rights reserved.
Getting Started
•  Pivotal HD Single-Node VM and Community
Edition
–  htt...
88© Copyright 2014 Pivotal. All rights reserved.
Acknowledgements
•  Apache Hadoop, the Hadoop elephant logo, HDFS,
Accumu...
89© Copyright 2014 Pivotal. All rights reserved.
•  Talk to us on Twitter: @mewzherder (Tamao, not
me)
•  Sign up for more...
90© Copyright 2014 Pivotal. All rights reserved.
Questions ??
Upcoming SlideShare
Loading in...5
×

Apache hadoop: POSH Meetup Palo Alto, CA April 2014

561

Published on

A presentation on Apache Hadoop in Palo Alto, CA for POSH (http://www.meetup.com/Pivotal-Open-Source-Hub/)

Published in: Data & Analytics, Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
561
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Apache hadoop: POSH Meetup Palo Alto, CA April 2014"

  1. 1. 1© Copyright 2014 Pivotal. All rights reserved. 1© Copyright 2014 Pivotal. All rights reserved. Intro to Hadoop: Hype or Reality – you decide kcrocker@gopivotal.com Pivotal Meet-up Kevin Crocker, Consulting Instructor, Pivotal Academy March 19, 2014
  2. 2. 2© Copyright 2014 Pivotal. All rights reserved. Why is this Meet-up necessary •  What is the future of enterprise data architecture? –  The explosion of data –  Volume, Variety, Velocity –  Overruns traditional data stores –  What is the business value of collecting all this data?
  3. 3. 3© Copyright 2014 Pivotal. All rights reserved. Volume •  At a recent data conference, one participant told the audience that they collected 7 PB of data a day – and generated another 7 PB of data analytics •  That’s 63 racks! A day! X 2 •  What do we even call that amount of data? –  Data Warehouse(s), Data Store(s) –  New Term: Data Lake
  4. 4. 4© Copyright 2014 Pivotal. All rights reserved. Variety •  At the same data conference, another presenter participated in a study using wearable medical technology to monitor health –  Collected 1 million readings a day = 12 readings a second –  when was the last time you had YOUR blood pressure checked? •  Toronto – so many sensors they can track millions of cell phones over 400 square miles – 24x7
  5. 5. 5© Copyright 2014 Pivotal. All rights reserved. Velocity •  Ingesting this amount of data is difficult •  Analyzing this amount of data in traditional ways is also difficult –  A client recently told me that it used to take 3 weeks for them to analyze the data from their sensors, now they do it in 3 hours
  6. 6. 6© Copyright 2014 Pivotal. All rights reserved. Business Value •  Wall Street Journal – those businesses in Toronto pay to get summary reports of all that data and then gear their marketing campaigns to drive new revenue
  7. 7. 7© Copyright 2014 Pivotal. All rights reserved. The Data Lake Dream, Forbes, 01/14/2014 •  In an article published in Forbes, the author mentions the term Data Lake and the technology that addresses the problem of big data => Hadoop •  Four levels of Hadoop Maturity –  Life Before Hadoop -> Hadoop is Introduced -> Growing the Data Lake -> Data Lake and Application Cloud
  8. 8. 8© Copyright 2014 Pivotal. All rights reserved. So – Let’s talk about Hadoop •  Hadoop Overview –  Core Elements: HDFS and MapReduce –  Ecosystem •  HDFS Architecture •  Hadoop MapReduce •  Hadoop Ecosystem •  MapReduce Primer •  Buckle up!
  9. 9. 9© Copyright 2014 Pivotal. All rights reserved. 9© Copyright 2014 Pivotal. All rights reserved. Hadoop Overview
  10. 10. 10© Copyright 2014 Pivotal. All rights reserved. Hadoop Core •  Based on two Google papers in 2003/4 – Google File System and MapReduce •  Spawned off Nutch open-source web-search because of the need to store the data •  Open-source Apache project out of Yahoo! in January 2006 •  Distributed fault-tolerant data storage (distribution and replication of resources) and distributed batch processing (not for random reads/writes, or updates) •  Provides linear scalability on commodity hardware •  Adopted by many: –  Amazon, AOL, eBay, Facebook, Foursquare, Google, IBM, Netflix, Twitter, Yahoo!, and many, many more http://wiki.apache.org/hadoop/PoweredBy •  Hadoop uses data redundancy rather than backup strategies
  11. 11. 11© Copyright 2014 Pivotal. All rights reserved. Hadoop Overview •  Consists of: –  Key sub-projects •  Hadoop Common: Common utilities/tools for all Hadoop components/sub-projects •  HDFS: A reliable, high-bandwidth, distributed file system •  Map/Reduce: A programming framework to process large datasets •  YARN –  Other key Apache projects in Hadoop ecosystem •  Avro: A data serialization system •  Hbase/Cassandra: A scalable, distributed no-sql databases, supports structured data storage for large tables. •  Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying. •  Pig: A high-level data-flow language and execution framework for parallel computation. •  ZooKeeper: A high-performance coordination service for distributed application •  Latest version of Hadoop; –  Stable and widely used latest version – V1 => 1.2.1, V2 => 2.2.0
  12. 12. 12© Copyright 2014 Pivotal. All rights reserved. Why? •  Bottom line: –  Flexible –  Scalable –  Inexpensive
  13. 13. 13© Copyright 2014 Pivotal. All rights reserved. Overview •  Great at –  Reliable storage for multi-petabyte data sets –  Batch queries and analytics –  Complex hierarchical data structures with changing schemas, unstructured and structured data •  Not so great at –  Changes to files (can’t do it…) – not OLTP –  Low-latency responses –  Analyst usability •  This is less of a concern now due to higher-level languages
  14. 14. 14© Copyright 2014 Pivotal. All rights reserved. Data Structure •  Bytes! And more Bytes! (Peta) •  No more ETL necessary??? •  Store data now, process later •  Structure (schema) on read –  Built-in support for common data types and formats –  Extendable –  Flexible
  15. 15. 15© Copyright 2014 Pivotal. All rights reserved. Versioning •  Version 0.20.x, 0.21.x, 0.22.x, 0.23.x1.x.x –  Two main MR packages: •  org.apache.hadoop.mapred (deprecated) •  org.apache.hadoop.mapreduce (new hotness) •  Version 2.2.0, GA Oct 2013 –  NameNode HA –  YARN – Next Gen MapReduce –  HDFS Federation, Snapshots
  16. 16. 16© Copyright 2014 Pivotal. All rights reserved. 16© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture
  17. 17. 17© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture
  18. 18. 18© Copyright 2014 Pivotal. All rights reserved. HDFS Architecture (Master/Worker) •  HDFS Master: “Namenode” –  Manages the filesystem namespace –  Controls read/write access to files –  Serves open/close/rename file requests from client –  Manages block replication (rack-aware block placement, auto re-replication) –  Checkpoints namespace and journals namespace changes for reliability •  HDFS Workers: “Datanodes” –  Serve read/write requests from clients –  Perform replication tasks upon instruction by Namenode –  Periodically validate the data checksum •  HDFS Client –  Interface available in Java, C, and command line. –  Client computes and validates checksum stored by Datanode for data integrity check (if block is corrupt, then other replica is accessed)
  19. 19. Hadoop Distributed File System Data Model: •  Data is organized into files and directories •  Files are divided into uniformly-sized blocks and distributed across cluster nodes •  Blocks are replicated to handle hardware failure •  Filesystem keeps checksums of data for corruption detection and recovery •  Read requests are always served from closest replica •  Not strictly POSIX-compliant
  20. 20. 20© Copyright 2014 Pivotal. All rights reserved. Hadoop Distributed File System •  Distributed, Fault-Tolerant & Scalable (petabyte) File System: •  Designed to run on commodity hardware Hardware failure is a norm (RAID-1 - Block level replication) •  High throughput for Streaming/Sequential data access As opposed to low latency for random I/O •  Tuned for smaller number of large size data files •  Simple Coherency model (Write once, read multiple times) Append data to a file is supported in 0.19 •  Support for scalable data processing Exposes metadata as # of block replicas and their locations etc., for scheduling computations closer to data •  Portability across heterogeneous HW & SW platforms File system written in Java •  High Availability and Namespace federation support (2.0.x-alpha)
  21. 21. 21© Copyright 2014 Pivotal. All rights reserved. HDFS Overview •  Hierarchical UNIX-like file system for data storage –  sort of (files, folders, permissions, users, groups) … but it is a virtual file system •  Splitting of large files into blocks •  Distribution and replication of blocks to nodes •  Two key services –  Master NameNode –  Many DataNodes •  Checkpoint Node (Secondary NameNode)
  22. 22. 22© Copyright 2014 Pivotal. All rights reserved. NameNode •  Single master service for HDFS •  Single point of failure (HDFS 1.x; not 2.x) •  Stores file to block to location mappings in the namespace •  All transactions are logged to disk •  NameNode startup reads namespace image and logs
  23. 23. 23© Copyright 2014 Pivotal. All rights reserved. Checkpoint Node (Secondary NN) •  Performs checkpoints of the NameNode’s namespace and logs •  Not a hot backup! 1.  Loads up namespace 2.  Reads log transactions to modify namespace 3.  Saves namespace as a checkpoint
  24. 24. 24© Copyright 2014 Pivotal. All rights reserved. DataNode •  Stores blocks on local disk •  Sends frequent heartbeats to NameNode •  Sends block reports to NameNode (all the block IDs it has, checksums, etc) •  Clients connect to DataNode for I/O
  25. 25. 25© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNode 1 Client 2 A1 3 A2 A3 A4 Client contacts NameNode to write data NameNode says write it to these nodes Client sequentially Writes blocks to DataNode
  26. 26. 26© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Writes DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 DataNodes replicate data blocks, orchestrated by the NameNode
  27. 27. 27© Copyright 2014 Pivotal. All rights reserved. How HDFS Works - Reads DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 1 2 3 Client contacts NameNode to read data NameNode says you can find it here Client sequentially reads blocks from DataNode
  28. 28. 28© Copyright 2014 Pivotal. All rights reserved. DataNode A DataNode B DataNode C DataNode D NameNodeClient A1 A2 A3 A4 A1A1 A2A2 A3A3A4 A4 Client connects to another node serving that block How HDFS Works - Failure
  29. 29. 29© Copyright 2014 Pivotal. All rights reserved. Block Replication •  Default of three replica’s •  Rack-aware system –  One block on same rack –  One block on same rack, different host –  One block on another rack •  Automatic re-copy by NameNode, as needed Rack 1 DN DN DN … Rack 2 DN DN DN …
  30. 30. 30© Copyright 2014 Pivotal. All rights reserved. HDFS 2.0 Features •  NameNode High-Availability (HA) –  Two redundant NameNodes in active/passive configuration –  Manual or automated failover •  NameNode Federation –  Multiple independent NameNodes using the same collection of DataNodes
  31. 31. 31© Copyright 2014 Pivotal. All rights reserved. 31© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce
  32. 32. •  Programming model processing list of key/value pairs •  Map function: processes input key/value pairs and produces set of intermediate key/value pairs. •  Reduce function: merges all intermediate values associated with the same intermediate key and produces output key/value pairs. Map-Reduce Programming Model Input (k1, v1) Output K2, List(V3) Intermediate Output List (K2, V2) Reduce Sort or Group by K2 (K2, List(V2)) Map
  33. 33. Application Writer Specifies: • Map and Reduce classes • Input data on HDFS • Input/Output format classes (optional) Workflow: •  Input phase generates a number of logical FileSplits from input files • One Map task is created per logical file split •  Each Map task loads Map class and executes map function to transform input kv-pairs into a new set of kv-pairs •  Record reader class supplied part of InputFormat reads a input record as k-v pair •  Map output keys are stored on local disk in sorted partitions, one per task •  One invocation of map function per k-v pair from an associated input split •  Each Reduce task fetches map output (from its associated partition) as soon as map task finishes its processing •  Map outputs are merged •  One invocation of reduce function per distinct key and its associated list of values •  Output k-v pairs are stored on HDFS, one file per reduce task •  Framework handles task scheduling and recovery. Km+1…N Output Part-0 Output Part-1 Input Split 0 Input HDFS File K1..m K1..mK1..m Km+1…N Km+1…N Sorted Partitions Map 0 Map 1 Map 2 Sorted Partitions Sorted Partitions Reduce 0 Reduce 1 Shuffle Input Split 2 Input Split 1 Merge & Sort Merge & Sort Parallel Execution Model for Map-Reduce Km+1…N
  34. 34. 34© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce 1.x •  Moves the code to the data •  JobTracker –  Master service to monitor jobs •  TaskTracker –  Multiple services to run tasks in parallel –  Same physical machine as a DataNode •  A job contains many tasks (One data block equals one task ) •  A task contains one or more task attempts (success = good, failed task attempts are given to another Task Tracker for processing: 4 single failed task attempts = one failed job)
  35. 35. 35© Copyright 2014 Pivotal. All rights reserved. JobTracker •  Monitors job and task progress •  Issues task attempts to TaskTrackers •  Re-tries failed task attempts •  Four failed attempts = one failed job •  Schedules jobs in FIFO order –  Fair Scheduler •  Single point of failure for MapReduce
  36. 36. 36© Copyright 2014 Pivotal. All rights reserved. TaskTrackers •  Runs on same node as DataNode service •  Sends heartbeats and task reports to JobTracker •  Configurable number of map and reduce slots •  Runs map and reduce task attempts –  Separate JVM!
  37. 37. 37© Copyright 2014 Pivotal. All rights reserved. Exploiting Data Locality •  JobTracker will schedule task on a TaskTracker that is local to the block –  3 options! Because 3 replica’s •  If TaskTracker is busy, selects TaskTracker on same rack –  Many options! •  If still busy, chooses an available TaskTracker at random – Rare!
  38. 38. 38© Copyright 2014 Pivotal. All rights reserved. YARN (aka MapReduce 2) •  Abstract framework for distributed application development •  Split functionality of JobTracker into two components –  ResourceManager –  ApplicationMaster •  TaskTracker becomes NodeManager –  Containers instead of map and reduce slots •  Configurable amount of memory per NodeManager
  39. 39. 39© Copyright 2014 Pivotal. All rights reserved. How MapReduce Works DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTracker 1 Client 4 2 B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 3 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D Client submits job to JobTracker JobTracker submits tasks to TaskTrackers Job output is written to DataNodes w/replication JobTracker reports metrics back to client
  40. 40. 40© Copyright 2014 Pivotal. All rights reserved. DataNode A A1 A2 A4 A2 A1 A3 A3 A2 A4 A4 A1 A3 JobTrackerClient B1 B3 B4 B2 B3 B1 B3 B2 B4 B4 B1 B2 DataNode B DataNode C DataNode D TaskTracker A TaskTracker B TaskTracker C TaskTracker D How MapReduce Works - Failure JobTracker assigns task to different node
  41. 41. 41© Copyright 2014 Pivotal. All rights reserved. MapReduce 2.x on YARN •  MapReduce API has not changed –  Rebuild required to upgrade from 1.x to 2.x •  MapReduce History Server to store… history
  42. 42. 42© Copyright 2014 Pivotal. All rights reserved. YARN – Architecture •  Client •  Submit Job/applications •  Resource Manager •  Schedule resources •  AppMaster •  Manage/monitor lifecycle of the M/R Job •  Node Manager •  Manage/monitor task lifecycle •  Container •  Task JVM •  No distinction between map and reduce tasks
  43. 43. 43© Copyright 2014 Pivotal. All rights reserved. YARN – Map/Reduce
  44. 44. 44© Copyright 2014 Pivotal. All rights reserved. 44© Copyright 2014 Pivotal. All rights reserved. Hadoop Ecosystem
  45. 45. 45© Copyright 2014 Pivotal. All rights reserved. Hadoop Ecosystem •  Core Technologies –  Hadoop Distributed File System –  Hadoop MapReduce •  Many other tools… –  Which I will be describing… now
  46. 46. 46© Copyright 2014 Pivotal. All rights reserved. Moving Data •  Sqoop –  Moving data between RDBMS and HDFS –  Say, migrating MySQL tables to HDFS •  Flume –  Streams event data from sources to sinks –  Say, weblogs from multiple servers into HDFS
  47. 47. 47© Copyright 2014 Pivotal. All rights reserved. Flume Architecture
  48. 48. 48© Copyright 2014 Pivotal. All rights reserved. Higher Level APIs •  Pig –  Data-flow language – aptly named PigLatin -- to generate one or more MapReduce jobs against data stored locally or in HDFS •  Hive –  Data warehousing solution, allowing users to write SQL-like queries to generate a series of MapReduce jobs against data stored in HDFS
  49. 49. 49© Copyright 2014 Pivotal. All rights reserved. Pig Word Count A = LOAD '$input'; B = FOREACH A GENERATE FLATTEN(TOKENIZE($0)) AS word; C = GROUP B BY word; D = FOREACH C GENERATE group AS word, COUNT(B); STORE D INTO '$output';
  50. 50. 50© Copyright 2014 Pivotal. All rights reserved. Key/Value Stores •  HBase •  Accumulo •  Implementations of Google’s Big Table for HDFS •  Provides random, real-time access to big data •  Supports updates and deletes of key/value pairs
  51. 51. 51© Copyright 2014 Pivotal. All rights reserved. HBase Architecture MasterZooKeeper RegionServer Region Store StoreFile MemStore StoreFile Store StoreFile MemStore StoreFile Client HDFS RegionServer Region Store StoreFile MemStore StoreFile Store StoreFile MemStore StoreFile
  52. 52. 52© Copyright 2014 Pivotal. All rights reserved. Data Structure •  Avro –  Data serialization system designed for the Hadoop ecosystem –  Expressed as JSON •  Parquet –  Compressed, efficient columnar storage for Hadoop and other systems
  53. 53. 53© Copyright 2014 Pivotal. All rights reserved. Scalable Machine Learning •  Mahout –  Library for scalable machine learning written in Java –  Very robust examples! –  Classification, Clustering, Pattern Mining, Collaborative Filtering, and much more
  54. 54. 54© Copyright 2014 Pivotal. All rights reserved. Workflow Management •  Oozie –  Scheduling system for Hadoop Jobs –  Support for: •  Java MapReduce •  Streaming MapReduce •  Pig, Hive, Sqoop, Distcp •  Any ol’ Java or shell script program
  55. 55. 55© Copyright 2014 Pivotal. All rights reserved. Real-time Stream Processing •  Storm –  Open-source project which runs a streaming of data, called a spout, to a series of execution agents called bolts –  Scalable and fault- tolerant, with guaranteed processing of data –  Benchmarks of over a million tuples processed per second per node
  56. 56. 56© Copyright 2014 Pivotal. All rights reserved. Distributed Application Coordination •  ZooKeeper –  An effort to develop and maintain an open-source server which enables highly reliable distributed coordination –  Designed to be simple, replicated, ordered, and fast –  Provides configuration management, distributed synchronization, and group services for applications
  57. 57. 57© Copyright 2014 Pivotal. All rights reserved. ZooKeeper Architecture
  58. 58. 58© Copyright 2014 Pivotal. All rights reserved. Hadoop Streaming •  Can define Mapper and Reduce using Unix text filters –  Typically use grep, sed, python, or perl scripts •  Format for input and output is: key t value n •  Allows for easy debugging and experimentation •  Slower than Java programs •  bin/hadoop jar hadoop-streaming.jar -input in-dir -output out-dir -mapper streamingMapper.sh - reducer streamingReducer.sh –  Mapper: /bin/sed -e 's| |n|g' | /bin/grep . –  Reducer: /usr/bin/uniq -c | /bin/awk '{print $2 "t" $1}'
  59. 59. 59© Copyright 2014 Pivotal. All rights reserved. Hadoop Streaming Architecture JobTracker (Master) TaskTracker (Slave) Map Task TaskTracker (Slave) Mapper Executable I/P HDFS File STDOUT STDIN Reduce Task Reducer Executable STDOUT STDIN O/P HDFS File K t V http://hadoop.apache.org/docs/stable/streaming.html
  60. 60. 60© Copyright 2014 Pivotal. All rights reserved. SQL on Hadoop •  Apache Drill •  Cloudera Impala •  Hive Stinger •  Pivotal HAWQ •  MPP execution of SQL queries against HDFS data
  61. 61. 61© Copyright 2014 Pivotal. All rights reserved. That’s a lot of projects •  I am likely missing several (Sorry, guys!) •  Each cropped up to solve a limitation of Hadoop Core •  Know your ecosystem •  Pick the right tool for the right job
  62. 62. 62© Copyright 2014 Pivotal. All rights reserved. Sample Architecture HDFS Flume Agent Flume Agent Flume Agent MapReduce Pig HBase Storm Website Oozie Webserver Sales Call Center SQL SQL
  63. 63. 63© Copyright 2014 Pivotal. All rights reserved. 63© Copyright 2014 Pivotal. All rights reserved. MapReduce Primer
  64. 64. 64© Copyright 2014 Pivotal. All rights reserved. MapReduce Paradigm •  Data processing system with two key phases •  Map –  Perform a map function on input key/value pairs to generate intermediate key/value pairs •  Reduce –  Perform a reduce function on intermediate key/value groups to generate output key/value pairs •  Groups created by sorting map output
  65. 65. Reduce Task 0 Reduce Task 1 Map Task 0 Map Task 1 Map Task 2 (0, "hadoop is fun") (52, "I love hadoop") (104, "Pig is more fun") ("hadoop", 1) ("is", 1) ("fun", 1) ("I", 1) ("love", 1) ("hadoop", 1) ("Pig", 1) ("is", 1) ("more", 1) ("fun", 1) ("hadoop", {1,1}) ("is", {1,1}) ("fun", {1,1}) ("love", {1}) ("I", {1}) ("Pig", {1}) ("more", {1}) ("hadoop", 2) ("fun", 2) ("love", 1) ("I", 1) ("is", 2) ("Pig", 1) ("more", 1) SHUFFLE AND SORT Map Input Map Output Reducer Input Groups Reducer Output
  66. 66. 66© Copyright 2014 Pivotal. All rights reserved. Hadoop MapReduce Components •  Map Phase –  Input Format –  Record Reader –  Mapper –  Combiner –  Partitioner •  Reduce Phase –  Shuffle –  Sort –  Reducer –  Output Format –  Record Writer
  67. 67. 67© Copyright 2014 Pivotal. All rights reserved. Writable Interfaces public interface Writable {" " void write(DataOutput out);" void readFields(DataInput in);" }" " public interface WritableComparable<T> extends Writable, Comparable<T> {" }" •  BooleanWritable •  BytesWritable •  ByteWritable •  DoubleWritable •  FloatWritable •  IntWritable •  LongWritable •  NullWritable •  Text
  68. 68. 68© Copyright 2014 Pivotal. All rights reserved. InputFormat " " public abstract class InputFormat<K, V> {" " public abstract List<InputSplit> getSplits(JobContext context);" " public abstract RecordReader<K, V>" "createRecordReader(InputSplit split, TaskAttemptContext context);" }"
  69. 69. 69© Copyright 2014 Pivotal. All rights reserved. RecordReader public abstract class RecordReader<KEYIN, VALUEIN> implements Closeable {" " public abstract void initialize(InputSplit split, TaskAttemptContext context);" " public abstract boolean nextKeyValue();" " public abstract KEYIN getCurrentKey();" " public abstract VALUEIN getCurrentValue();" " public abstract float getProgress();" " public abstract void close();" }"
  70. 70. 70© Copyright 2014 Pivotal. All rights reserved. Mapper public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {" protected void setup(Context context) { /* NOTHING */ }" protected void cleanup(Context context) { /* NOTHING */ }" " protected void map(KEYIN key, VALUEIN value, Context context) {" context.write((KEYOUT) key, (VALUEOUT) value);" }" " public void run(Context context) {" setup(context);" while (context.nextKeyValue())" map(context.getCurrentKey(), context.getCurrentValue(), context);" cleanup(context);" }" }"
  71. 71. 71© Copyright 2014 Pivotal. All rights reserved. Partitioner " " public abstract class Partitioner<KEY, VALUE> {" " public abstract int getPartition(KEY key, VALUE value, int numPartitions);" " }" " •  Default HashPartitioner uses key’s hashCode() % numPartitions
  72. 72. 72© Copyright 2014 Pivotal. All rights reserved. Reducer public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {" protected void setup(Context context) { /* NOTHING */ }" protected void cleanup(Context context) { /* NOTHING */ }" " protected void reduce(KEYIN key, Iterable<VALUEIN> value, Context context) {" for (VALUEIN value : values)" context.write((KEYOUT) key, (VALUEOUT) value);" }" " public void run(Context context) {" setup(context);" while (context.nextKey())" reduce(context.getCurrentKey(), context.getValues(), context);" cleanup(context);" }" }"
  73. 73. 73© Copyright 2014 Pivotal. All rights reserved. OutputFormat " " public abstract class OutputFormat<K, V> {" " public abstract RecordWriter<K, V>" " " "getRecordWriter(TaskAttemptContext context);" " public abstract void checkOutputSpecs(JobContext context);" " public abstract OutputCommitter" " " "getOutputCommitter(TaskAttemptContext context);" }"
  74. 74. 74© Copyright 2014 Pivotal. All rights reserved. RecordWriter " " public abstract class RecordWriter<K, V> {" " public abstract void write(K key, V value);" " public abstract void close(TaskAttemptContext context);" }"
  75. 75. 75© Copyright 2014 Pivotal. All rights reserved. Some M/R Concepts / knobs •  Configuration –  {hdfs,yarn,mapred}-default.xml -- default config (contain both services & client config) –  {hdfs,yarn,mapred}-site.xml -- Service config used for cluster specific over-rides, –  {hdfs,yarn,mapred}-client.xml -- Client specific config •  Input/Output Formats –  TextFileInputFormat, KeyValueTextFileInputFormat, NLineInputFormat, SequenceFileInputFormat –  Pluggable input/output formats provide ability for Jobs to read/write data in different formats –  Major function •  getSplits •  RecordReader •  Schedulers –  Pluggable resource scheduler used by Resource Manager –  Default, Capacity Scheduler & Fair scheduler •  Combiner –  Combine individual map output before sending to reducer –  Lowers intermediate data •  Partitioner –  Pluggable class to partition the map output among number of reducers
  76. 76. 76© Copyright 2014 Pivotal. All rights reserved. Some M/R knobs •  Compression –  Enable compression of Map/Reduce output –  Gzip, lzo, bz2 codecs available with framework •  Counters –  Ability to keep track of various job statistics e.g. num bytes read, written –  Available for each task and also aggregated per job. –  Job can write its own custom counters •  Speculative Executions –  Provides task recovery against hardware issues •  Distributed cache –  Ability to make job specific data available to each •  Tool – M/R application helper classes, Support ability for job to accept generic options, e.g. –  -conf <configuration file> specify an application configuration file –  -D <property=value> use value for given property –  -fs <local|namenode:port> specify a namenode –  -jt <local|jobtracker:port> specify a job tracker –  -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster –  -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath. –  -archives <comma separated list of archives> specify comma separated archives to be unarchived on the compute machines.
  77. 77. 77© Copyright 2014 Pivotal. All rights reserved. 77© Copyright 2014 Pivotal. All rights reserved. Word Count Example
  78. 78. 78© Copyright 2014 Pivotal. All rights reserved. Problem •  Count the number of times each word is used in a body of text •  Uses TextInputFormat and TextOutputFormat map(byte_offset, line) foreach word in line emit(word, 1) reduce(word, counts) sum = 0 foreach count in counts sum += count emit(word, sum)
  79. 79. 79© Copyright 2014 Pivotal. All rights reserved. Word Count Example
  80. 80. 80© Copyright 2014 Pivotal. All rights reserved. Mapper Code public class WordMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ " private final static IntWritable ONE = new IntWritable(1);" private Text word = new Text();" " public void map(LongWritable key, Text value, Context context) {" String line = value.toString();" StringTokenizer tokenizer = new StringTokenizer(line);" " while (tokenizer.hasMoreTokens()) {" word.set(tokenizer.nextToken());" context.write(word, ONE);" }" }" }"
  81. 81. 81© Copyright 2014 Pivotal. All rights reserved. Shuffle and Sort P0 P1 1P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 P0 P0 P0 P0 P1 P1 P1 P1 P2 P2 P2 P2 P3 P3 P3 P3 2 3 P0 P1 P2 P3 Reducer 0 Reducer 1 Reducer 2 Reducer 3 Mapper 0 Mapper 1 Mapper 2 Mapper 3 Mapper outputs to a single logically partitioned file Reducers copy their parts Reducer merges partitions, sorting by key
  82. 82. 82© Copyright 2014 Pivotal. All rights reserved. Reducer Code public class IntSumReducer" "extends Reducer<Text, LongWritable, Text, IntWritable> {" private IntWritable outvalue = new IntWritable();" private int sum = 0;" " public void reduce(Text key, Iterable<IntWritable> values, Context context) {" sum = 0;" for (IntWritable val : values) {" sum += val.get();" }" outvalue.set(sum);" context.write(key, outvalue);" }" }"
  83. 83. 83© Copyright 2014 Pivotal. All rights reserved. So what’s so hard about it? MapReduce that’s a tiny box All the problems you'll ever have ever
  84. 84. 84© Copyright 2014 Pivotal. All rights reserved. So what’s so hard about it? •  MapReduce is a limitation •  Entirely different way of thinking •  Simple processing operations such as joins are not so easy when expressed in MapReduce •  Proper implementation is not so easy •  Lots of configuration and implementation details for optimal performance –  Number of reduce tasks, data skew, JVM size, garbage collection
  85. 85. 85© Copyright 2014 Pivotal. All rights reserved. So what does this mean for you? •  Hadoop is written primarily in Java •  Components are extendable and configurable •  Custom I/O through Input and Output Formats –  Parse custom data formats –  Read and write using external systems •  Higher-level tools enable rapid development of big data analysis
  86. 86. 86© Copyright 2014 Pivotal. All rights reserved. Resources, Wrap-up, etc. •  http://hadoop.apache.org •  Very supportive community •  Plenty of resources available to learn more –  Blogs –  Email lists –  Books –  Shameless Plug -- MapReduce Design Patterns
  87. 87. 87© Copyright 2014 Pivotal. All rights reserved. Getting Started •  Pivotal HD Single-Node VM and Community Edition –  http://gopivotal.com/pivotal-products/data/pivotal-hd •  For the brave and bold -- Roll-your-own! –  http://hadoop.apache.org/docs/current
  88. 88. 88© Copyright 2014 Pivotal. All rights reserved. Acknowledgements •  Apache Hadoop, the Hadoop elephant logo, HDFS, Accumulo, Avro, Drill, Flume, HBase, Hive, Mahout, Oozie, Pig, Sqoop, YARN, and ZooKeeper are trademarks of the Apache Software Foundation •  Cloudera Impala is a trademark of Cloudera •  Parquet is copyright Twitter, Cloudera, and other contributors •  Storm is licensed under the Eclipse Public License
  89. 89. 89© Copyright 2014 Pivotal. All rights reserved. •  Talk to us on Twitter: @mewzherder (Tamao, not me) •  Sign up for more Hadoop –  http://bit.ly/POSH0018 •  Pivotal Education –  http://www.gopivotal.com/training Learn More. Stay Connected.
  90. 90. 90© Copyright 2014 Pivotal. All rights reserved. Questions ??

×