0
Practical Problem
Solving with Hadoop
       and Pig
        Milind Bhandarkar
    (milindb@yahoo-inc.com)
Agenda
        • Introduction
        • Hadoop
         • Distributed File System
         • Map-Reduce
        • Pig
    ...
Agenda: Morning
                   (8.30 - 12.00)
        • Introduction
        • Motivating Examples
        • Hadoop Di...
Agenda: Afternoon
                    (1.30 - 5.00)
        • Performance Tuning
        • Hadoop Examples
        • Pig
 ...
About Me
        • Lead Yahoo! Grid Solutions Team since June
              2005
        • Contributor to Hadoop since Jan...
Hadoop At Yahoo!
       6
Hadoop At Yahoo!
                  (Some Statistics)
        • 25,000 + machines in 10+ clusters
        • Largest cluster...
Sample Applications
        • Data analysis is the inner loop of Web 2.0
         • Data ⇒ Information ⇒ Value
        • L...
Prominent Hadoop
           Users
•   Yahoo!                      •   Quantcast

•   A9.com                      •   Joost...
Yahoo! Search Assist
         10
Search Assist
        • Insight: Related concepts appear close
              together in text corpus
        • Input: Web ...
Search Assist
// Input: List(URL, Text)
foreach URL in Input :
    Words = Tokenize(Text(URL));
    foreach word in Tokens...
You Might Also Know
You Might Also Know
        • Insight:You might also know Joe Smith if a
              lot of folks you know, know Joe Smi...
You Might Also Know

// Input: List(UserName, List(Connections))

foreach u in UserList : // 300 MM
    foreach x in Conne...
Performance

        • 101 Random accesses for each user
         • Assume 1 ms per random access
         • 100 ms per us...
MapReduce Paradigm



        17
Map & Reduce

        • Primitives in Lisp (& Other functional
              languages) 1970s
        • Google Paper 2004
...
Map

Output_List = Map (Input_List)




Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) =

(1, 4, 9, 16, 25, 36,49, 64, 81, 100)


...
Reduce

Output_Element = Reduce (Input_List)




Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385




                     ...
Parallelism
        • Map is inherently parallel
         • Each list element processed
                  independently
  ...
Search Assist Map
// Input: http://hadoop.apache.org

Pairs = Tokenize_And_Pair ( Text ( Input ) )




Output = {
(apache,...
Search Assist Reduce
// Input: GroupedList (word, GroupedList(words))

CountedPairs = CountOccurrences (word, RelatedWords...
Issues with Large Data
        • Map Parallelism: Splitting input data
         • Shipping input data
        • Reduce Par...
Apache Hadoop
        • January 2006: Subproject of Lucene
        • January 2008: Top-level Apache project
        • Late...
Apache Hadoop
        • Reliable, Performant Distributed file system
        • MapReduce Programming framework
        • Su...
Problem: Bandwidth to
                Data
        • Scan 100TB Datasets on 1000 node cluster
         • Remote storage @ ...
Problem: Scaling
                       Reliably
        • Failure is not an option, it’s a rule !
         • 1000 nodes, ...
Hadoop Goals
        •     Scalable: Petabytes (1015 Bytes) of data on
              thousands on nodes
        • Economic...
Hadoop Distributed
   File System


        31
HDFS

        • Data is organized into files and directories
        • Files are divided into uniform sized blocks
        ...
HDFS

        • Blocks are replicated (default 3) to handle
              hardware failure
        • Replication for perfo...
HDFS

        • Master-Worker Architecture
        • Single NameNode
        • Many (Thousands) DataNodes

Middleware 2009...
HDFS Master
                  (NameNode)
        • Manages filesystem namespace
        • File metadata (i.e. “inode”)
    ...
Namenode

        • Mapping of datanode to list of blocks
        • Monitor datanode health
        • Replicate missing bl...
Datanodes
        • Handle block storage on multiple volumes
              & block integrity
        • Clients access the ...
HDFS Architecture
Replication
        • A file’s replication factor can be changed
              dynamically (default 3)
        • Block plac...
Accessing HDFS
hadoop fs [-fs <local | file system URI>] [-conf <configuration file>]
	   [-D <property=value>] [-ls <path...
HDFS Java API
// Get default file system instance
fs = Filesystem.get(new Configuration());

// Or Get file system instanc...
libHDFS
#include “hdfs.h”

hdfsFS fs = hdfsConnectNewInstance("default", 0);

hdfsFile writeFile = hdfsOpenFile(fs, “/tmp/...
Installing Hadoop
        • Check requirements
         • Java 1.6+
         • bash (Cygwin on Windows)
        • Download...
Download Hadoop

$ wget http://www.apache.org/dist/hadoop/core/
hadoop-0.18.3/hadoop-0.18.3.tar.gz

$ tar zxvf hadoop-0.18...
Set Environment
# Modify conf/hadoop-env.sh

$   export   JAVA_HOME=....
$   export   HADOOP_HOME=....
$   export   HADOOP...
Make Directories
# On Namenode, create metadata storage and tmp space
$ mkdir -p /home/hadoop/dfs/name
$ mkdir -p /tmp/had...
Start Daemons
# Modify hadoop-site.xml with appropriate
# fs.default.name, mapred.job.tracker, etc.

$ mv ~/myconf.xml con...
Check Namenode
Cluster Summary
Browse Filesystem
Browse Filesystem
Browse Filesystem
Questions ?
Hadoop MapReduce



       54
Think MR

        • Record = (Key,Value)
        • Key : Comparable, Serializable
        • Value: Serializable
        • ...
Seems Familiar ?


cat /var/log/auth.log* | 
grep “session opened” | cut -d’ ‘ -f10 | 
sort | 
uniq -c > 
~/userlist




 ...
Map

        • Input: (Key ,Value )
                     1         1

        • Output: List(Key ,Value )
                ...
Shuffle

        • Input: List(Key ,Value )
                         2         2

        • Output
         • Sort(Partitio...
Reduce

        • Input: List(Key , List(Value ))
                          2                2

        • Output: List(Key...
Example: Unigrams

        • Input: Huge text corpus
         • Wikipedia Articles (40GB uncompressed)
        • Output: L...
Unigrams

$ cat ~/wikipedia.txt | 
sed -e 's/ /n/g' | grep . | 
sort | 
uniq -c > 
~/frequencies.txt

$ cat ~/frequencies....
MR for Unigrams

mapper (filename, file-contents):
  for each word in file-contents:
    emit (word, 1)

reducer (word, va...
MR for Unigrams


mapper (word, frequency):
  emit (frequency, word)

reducer (frequency, words):
  for each word in words...
Dataflow
MR Dataflow
Unigrams: Java Mapper
public static class MapClass extends MapReduceBase
    implements Mapper
        <LongWritable, Text...
Unigrams: Java Reducer
public static class Reduce extends MapReduceBase
    implements Reducer
        <Text, IntWritable,...
Unigrams: Driver
public void run(String inputPath, String outputPath)
throws Exception {
    JobConf conf = new JobConf(Wo...
MapReduce Pipeline
Pipeline Details
Configuration
        • Unified Mechanism for
         • Configuring Daemons
         • Runtime environment for Jobs/Tasks
  ...
Example
<configuration>
  <property>
     <name>mapred.job.tracker</name>
     <value>head.server.node.com:9001</value>
  ...
InputFormats
        Format              Key Type        Value Type

   TextInputFormat
                           File Of...
OutputFormats
         Format                  Description

   TextOutputFormat
                                Key t Valu...
Hadoop Streaming
        • Hadoop is written in Java
         • Java MapReduce code is “native”
        • What about Non-J...
Hadoop Streaming
        • Thin Java wrappers for Map & Reduce Tasks
        • Forks actual Mapper & Reducer
        • IPC...
Hadoop Streaming
$ bin/hadoop jar hadoop-streaming.jar 
      -input in-files -output out-dir 
      -mapper mapper.sh -re...
Hadoop Pipes

        • Library for C/C++
        • Key & Value are std::string (binary)
        • Communication through U...
Pipes Program

#include "hadoop/Pipes.hh"
#include "hadoop/TemplateFactory.hh"
#include "hadoop/StringUtils.hh"

int main(...
Pipes Mapper
class WordCountMap: public HadoopPipes::Mapper {
public:
  WordCountMap(HadoopPipes::TaskContext& context){}
...
Pipes Reducer
class WordCountReduce: public HadoopPipes::Reducer {
public:
  WordCountReduce(HadoopPipes::TaskContext& con...
Running Pipes
# upload executable to HDFS
$ bin/hadoop fs -put wordcount /examples/bin

# Specify configuration
$ vi /tmp/...
MR Architecture
Job Submission
Initialization
Scheduling
Execution
Map Task
Sort Buffer
Reduce Task
Questions ?
Running Hadoop Jobs



         92
Running a Job
[milindb@gateway ~]$ hadoop jar 
$HADOOP_HOME/hadoop-examples.jar wordcount 
/data/newsarchive/20080923 /tmp...
Running a Job

mapred.JobClient: Counters: 18
mapred.JobClient:   Job Counters
mapred.JobClient:     Launched reduce tasks...
Running a Job

mapred.JobClient:   Map-Reduce Framework
mapred.JobClient:     Combine output records=73828180
mapred.JobCl...
JobTracker WebUI
JobTracker Status
Jobs Status
Job Details
Job Counters
Job Progress
All Tasks
Task Details
Task Counters
Task Logs
Debugging

        • Run job with the Local Runner
         • Set mapred.job.tracker to “local”
         • Runs applicatio...
Debugging
        • Set keep.failed.task.files to keep files from
              failed tasks
            • Use the Isolation...
Hadoop Performance
      Tuning


        108
Example
        • “Bob” wants to count records in AdServer
              logs (several hundred GB)
        • Used Identity...
MapReduce
                  Performance
        • Reduce intermediate data size
         • map outputs + reduce inputs
   ...
Shuffle

        • Often the most expensive component
        • M * R Transfers over the network
        • Sort map outputs...
Improving Shuffle

        • Avoid shuffling/sorting if possible
        • Minimize redundant transfers
        • Compress i...
Avoid Shuffle
        • Set mapred.reduce.tasks to zero
         • Known as map-only computations
         • Filters, Proje...
Minimize Redundant
                  Transfers

        • Combiners
        • Intermediate data compression


Middleware 2...
Combiners
        • When Maps produce many repeated keys
        • Combiner: Local aggregation after Map &
              b...
Compression
        • Often yields huge performance gains
        • Set mapred.output.compress to true to
              co...
Load Imbalance
        • Inherent in application
         • Imbalance in input splits
         • Imbalance in computations...
Optimal Number of
                  Nodes
        • T = Map slots per TaskTracker
                  m

        • N = optim...
Configuring Task Slots
        • mapred.tasktracker.map.tasks.maximum
        • mapred.tasktracker.reduce.tasks.maximum
   ...
Speculative Execution

        • Runs multiple instances of slow tasks
        • Instance that finishes first, succeeds
    ...
Hadoop Examples



       121
Example: Standard
         Deviation

•   Takeaway: Changing
    algorithm to suit
    architecture yields the
    best im...
Implementation 1

        • Two Map-Reduce stages
        • First stage computes Mean
        • Second stage computes stan...
Stage 1: Compute Mean

        • Map Input (x for i = 1 ..N )
                     i              m

        • Map Output ...
Stage 2: Compute
                  Standard Deviation
        • Map Input (x for i = 1 ..N ) & Mean(x )
                  ...
Standard Deviation

•   Algebraically equivalent

•   Be careful about
    numerical accuracy,
    though
Implementation 2
        • Map Input (x for i = 1 ..N )
                           i             m

        • Map Output (...
NGrams
Bigrams
        • Input: A large text corpus
        • Output: List(word , Top (word ))
                           1      ...
Bigrams: Stage 1
                       Map
        • Generate all possible Bigrams
        • Map Input: Large text corpus...
pairs.pl
while(<STDIN>) {
  chomp;
  $_ =~ s/[^a-zA-Z]+/ /g ;
  $_ =~ s/^s+//g ;
  $_ =~ s/s+$//g ;
  $_ =~ tr/A-Z/a-z/;
 ...
Bigrams: Stage 1
                      Reduce

        • Input: List(word , word ) sorted and
                            ...
count.pl
$_ = <STDIN>; chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(<STDIN>) {
  chomp;
  my ($w1, $w2) = sp...
Bigrams: Stage 2
                       Map
        • Input: List(word , [freq,word ])
                          1        ...
Bigrams: Stage 2
                      Reduce
        • Input: List(word , [freq,word ])
                         1       ...
firstN.pl
$N = 5;
$_ = <STDIN>; chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = "$pw1t$pw2,$count;";
whil...
Partitioner
        • By default, evenly distributes keys
         • hashcode(key) % NumReducers
        • Overriding part...
Partitioner

// JobConf.setPartitionerClass(className)

public interface Partitioner <K, V>
    extends JobConfigurable {
...
Fully Sorted Output
        • By contract, reducer gets input sorted on
              key
        • Typically reducer outp...
Fully Sorted Output

        • Use single reducer for small output
        • Insight: Reducer input must be fully sorted
 ...
Number of Maps
        • Number of Input Splits
         • Number of HDFS blocks
        • mapred.map.tasks
        • Mini...
Parameter Sweeps
        • External program processes data based on
              command-line parameters
        • ./prog...
Parameter Sweeps
        • Input File: params.txt
         • Each line contains one combination of
                  param...
Auxiliary Files

        • -file auxFile.dat
        • Job submitter adds file to job.jar
        • Unjarred on the task tra...
Auxiliary Files
        • Tasks need to access “side” files
         • Read-only Dictionaries (such as for porn
           ...
Distributed Cache

        • Specify “side” files via –cacheFile
        • If lot of such files needed
         • Create a t...
Distributed Cache

        • TaskTracker downloads these files “once”
        • Untars archives
        • Accessible in tas...
Joining Multiple
                      Datasets
        • Datasets are streams of key-value pairs
        • Could be split...
Example

        • A = (id, name), B = (name, address)
        • A is in /path/to/A/part-*
        • B is in /path/to/B/pa...
Map in Join

        • Input: (Key ,Value ) from A or B
                      1       1

         • map.input.file indicate...
Reduce in Join

        • Input: Groups of [Value , A|B] for each Key
                                     2              ...
MR Join Performance
        • Map Input = Total of A & B
        • Map output = Total of A & B
        • Shuffle & Sort
   ...
Join Special Cases
        • Fragment-Replicate
         • 100GB dataset with 100 MB dataset
        • Equipartitioned Dat...
Fragment-Replicate
        • Fragment larger dataset
         • Specify as Map input
        • Replicate smaller dataset
 ...
Equipartitioned Join
        • Available since Hadoop 0.16
        • Datasets joined “before” input to mappers
        • I...
Example

mapred.join.expr =
  inner (
    tbl (
       ....SequenceFileInputFormat.class,
       "hdfs://namenode:8020/pat...
Questions ?
Apache Pig
What is Pig?

        • System for processing large semi-
              structured data sets using Hadoop
              Ma...
Pig vs SQL
•   Pig is procedural              •   SQL is declarative

•   Nested relational data         •   Flat relation...
Pig vs Hadoop
        • Increases programmer productivity
        • Decreases duplication of effort
        • Insulates ag...
Example

•   Input: User profiles, Page
    visits

•   Find the top 5 most
    visited pages by users
    aged 18-25
In Native Hadoop
In Pig

Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (...
Natural Fit
Comparison
Flexibility & Control

        • Easy to plug-in user code
        • Metadata is not mandatory
        • Does not impose a...
Pig Data Types
        • Tuple: Ordered set of fields
         • Field can be simple or complex type
         • Nested rela...
Simple data types
        • int : 42
        • long : 42L
        • float : 3.1415f
        • double : 2.7182818
        • ...
Expressions

A = LOAD ‘data.txt’ AS
  (f1:int , f2:{t:(n1:int, n2:int)}, f3: map[] )




A =
{
      ( 1,                 ...
Pig Unigrams
        • Input: Large text document
        • Process:
         • Load the file
         • For each line, gen...
Load
myinput = load '/user/milindb/text.txt'
     USING TextLoader() as (myword:chararray);




{
    (program program)
  ...
Tokenize

words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*));




{
    (program) (program) (pig) (pig) (program) (pig)...
Group

grouped = GROUP words BY $0;




{
    (pig, {(pig), (pig), (pig), (pig), (pig)})
    (latin, {(latin), (latin), (l...
Count

counts = FOREACH grouped GENERATE group, COUNT(words);




{
    (pig, 5L)
    (latin, 3L)
    (hadoop, 1L)
    (pr...
Store

store counts into ‘/user/milindb/output’
      using PigStorage();




 pig       5
 latin     3
 hadoop    1
 prog...
Example: Log
             Processing
-- use a custom loader
Logs = load ‘/var/log/access_log’ using
    CommonLogLoader() ...
Schema on the fly

-- declare your types
Grades = load ‘studentgrades’ as
    (name: chararray, age: int,
    gpa: double);...
Nested Data

Logs = load ‘weblogs’ as (url, userid);
Grouped = group Logs by url;
-- Code inside {} will be applied to eac...
Pig Architecture
Pig Stages
Logical Plan
        • Directed Acyclic Graph
         • Logical Operator as Node
         • Data flow as edges
        • L...
Pig Statements
        Read data from the file
Load
               system


        Write data to the file
Store
           ...
Pig Statements
                     Apply expression to
                       each record and
Foreach..Generate
         ...
Pig Statements
                 Collect records with
Group/CoGroup   the same key from one
                    or more inp...
Physical Plan
        • Pig supports two back-ends
         • Local
         • Hadoop MapReduce
        • 1:1 corresponden...
MapReduce Plan
        • Detect Map-Reduce boundaries
         • Group, Cogroup, Order, Distinct
        • Coalesce operat...
Lazy Execution
        • Nothing really executes until you request
              output
        •         Store, Dump, Exp...
Parallelism

        • Split-wise parallelism on Map-side
              operators
        • By default, 1 reducer
        ...
Running Pig

$ pig
grunt > A = load ‘students’ as (name, age, gpa);
grunt > B = filter A by gpa > ‘3.5’;
grunt > store B i...
Running Pig
        • Batch mode
         • $ pig myscript.pig
        • Local mode
         • $ pig –x local
        • Ja...
PigPen
PigPen
Pig for SQL
Programmers


     194
SQL to Pig
      SQL                                 Pig

                     A = LOAD ‘MyTable’ USING PigStorage(‘t’) AS...
SQL to Pig
           SQL                               Pig

                             D = GROUP A BY (col1, col2)
SELE...
SQL to Pig

         SQL                             Pig


SELECT DISTINCT col1   I = FOREACH A GENERATE col1;
from X     ...
SQL to Pig
     SQL                                Pig

                   N = JOIN A by col1 INNER, B by col1 INNER;

   ...
Questions ?
Practical Problem Solving with Apache Hadoop & Pig
Upcoming SlideShare
Loading in...5
×

Practical Problem Solving with Apache Hadoop & Pig

101,743

Published on

Published in: Technology, Education
15 Comments
174 Likes
Statistics
Notes
  • http://dbmanagement.info/Tutorials/Apache_Hive.htm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • putting people on rails..wonderful explanation with live examples
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • @tspatel02 write udf function in java and regiester with pig latin and use in pig script.pig latin is dataflow language.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Good tutorial for beginner. Thanks
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • How do you implement Switch (or multiple if) statement in the PIG language.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
101,743
On Slideshare
0
From Embeds
0
Number of Embeds
50
Actions
Shares
0
Downloads
4,746
Comments
15
Likes
174
Embeds 0
No embeds

No notes for slide

Transcript of "Practical Problem Solving with Apache Hadoop & Pig"

  1. 1. Practical Problem Solving with Hadoop and Pig Milind Bhandarkar (milindb@yahoo-inc.com)
  2. 2. Agenda • Introduction • Hadoop • Distributed File System • Map-Reduce • Pig • Q &A Middleware 2009 2
  3. 3. Agenda: Morning (8.30 - 12.00) • Introduction • Motivating Examples • Hadoop Distributed File System • Hadoop Map-Reduce • Q &A Middleware 2009 3
  4. 4. Agenda: Afternoon (1.30 - 5.00) • Performance Tuning • Hadoop Examples • Pig • Pig Latin Language & Examples • Architecture • Q &A Middleware 2009 4
  5. 5. About Me • Lead Yahoo! Grid Solutions Team since June 2005 • Contributor to Hadoop since January 2006 • Trained 1000+ Hadoop users at Yahoo! & elsewhere • 20+ years of experience in Parallel Programming Middleware 2009 5
  6. 6. Hadoop At Yahoo! 6
  7. 7. Hadoop At Yahoo! (Some Statistics) • 25,000 + machines in 10+ clusters • Largest cluster is 3,000 machines • 3 Petabytes of data (compressed, unreplicated) • 700+ users • 10,000+ jobs/week Middleware 2009 7
  8. 8. Sample Applications • Data analysis is the inner loop of Web 2.0 • Data ⇒ Information ⇒ Value • Log processing: reporting, buzz • Search index • Machine learning: Spam filters • Competitive intelligence Middleware 2009 8
  9. 9. Prominent Hadoop Users • Yahoo! • Quantcast • A9.com • Joost • EHarmony • Last.fm • Facebook • Powerset • Fox Interactive Media • New York Times • IBM • Rackspace 9
  10. 10. Yahoo! Search Assist 10
  11. 11. Search Assist • Insight: Related concepts appear close together in text corpus • Input: Web pages • 1 Billion Pages, 10K bytes each • 10 TB of input data • Output: List(word, List(related words)) Middleware 2009 11
  12. 12. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 12
  13. 13. You Might Also Know
  14. 14. You Might Also Know • Insight:You might also know Joe Smith if a lot of folks you know, know Joe Smith • if you don’t know Joe Smith already • Numbers: • 300 MM users • Average connections per user is 100 Middleware 2009 14
  15. 15. You Might Also Know // Input: List(UserName, List(Connections)) foreach u in UserList : // 300 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 3 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 15
  16. 16. Performance • 101 Random accesses for each user • Assume 1 ms per random access • 100 ms per user • 300 MM users • 300 days on a single machine Middleware 2009 16
  17. 17. MapReduce Paradigm 17
  18. 18. Map & Reduce • Primitives in Lisp (& Other functional languages) 1970s • Google Paper 2004 • http://labs.google.com/papers/ mapreduce.html Middleware 2009 18
  19. 19. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 19
  20. 20. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 20
  21. 21. Parallelism • Map is inherently parallel • Each list element processed independently • Reduce is inherently sequential • Unless processing multiple lists • Grouping to produce multiple lists Middleware 2009 21
  22. 22. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) Output = { (apache, hadoop) (hadoop, mapreduce) (hadoop, streaming) (hadoop, pig) (apache, pig) (hadoop, DFS) (streaming, commandline) (hadoop, java) (DFS, namenode) (datanode, block) (replication, default)... } 22
  23. 23. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming, 4) (hadoop, mapreduce, 9) ... } 23
  24. 24. Issues with Large Data • Map Parallelism: Splitting input data • Shipping input data • Reduce Parallelism: • Grouping related data • Dealing with failures • Load imbalance Middleware 2009 24
  25. 25. Apache Hadoop • January 2006: Subproject of Lucene • January 2008: Top-level Apache project • Latest Version: 0.21 • Stable Version: 0.20.x • Major contributors:Yahoo!, Facebook, Powerset Middleware 2009 26
  26. 26. Apache Hadoop • Reliable, Performant Distributed file system • MapReduce Programming framework • Sub-Projects: HBase, Hive, Pig, Zookeeper, Chukwa, Avro • Related Projects: Mahout, Hama, Cascading, Scribe, Cassandra, Dumbo, Hypertable, KosmosFS Middleware 2009 27
  27. 27. Problem: Bandwidth to Data • Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins • Moving computation is more efficient than moving data • Need visibility into data placement Middleware 2009 28
  28. 28. Problem: Scaling Reliably • Failure is not an option, it’s a rule ! • 1000 nodes, MTBF < 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) • Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently Middleware 2009 29
  29. 29. Hadoop Goals • Scalable: Petabytes (1015 Bytes) of data on thousands on nodes • Economical: Commodity components only • Reliable • Engineering reliability into every application is expensive Middleware 2009 30
  30. 30. Hadoop Distributed File System 31
  31. 31. HDFS • Data is organized into files and directories • Files are divided into uniform sized blocks (default 64MB) and distributed across cluster nodes • HDFS exposes block placement so that computation can be migrated to data Middleware 2009 32
  32. 32. HDFS • Blocks are replicated (default 3) to handle hardware failure • Replication for performance and fault tolerance (Rack-Aware placement) • HDFS keeps checksums of data for corruption detection and recovery Middleware 2009 33
  33. 33. HDFS • Master-Worker Architecture • Single NameNode • Many (Thousands) DataNodes Middleware 2009 34
  34. 34. HDFS Master (NameNode) • Manages filesystem namespace • File metadata (i.e. “inode”) • Mapping inode to list of blocks + locations • Authorization & Authentication • Checkpoint & journal namespace changes Middleware 2009 35
  35. 35. Namenode • Mapping of datanode to list of blocks • Monitor datanode health • Replicate missing blocks • Keeps ALL namespace in memory • 60M objects (File/Block) in 16GB Middleware 2009 36
  36. 36. Datanodes • Handle block storage on multiple volumes & block integrity • Clients access the blocks directly from data nodes • Periodically send heartbeats and block reports to Namenode • Blocks are stored as underlying OS’s files Middleware 2009 37
  37. 37. HDFS Architecture
  38. 38. Replication • A file’s replication factor can be changed dynamically (default 3) • Block placement is rack aware • Block under-replication & over-replication is detected by Namenode • Balancer application rebalances blocks to balance datanode utilization Middleware 2009 39
  39. 39. Accessing HDFS hadoop fs [-fs <local | file system URI>] [-conf <configuration file>] [-D <property=value>] [-ls <path>] [-lsr <path>] [-du <path>] [-dus <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-rmr <src>] [-put <localsrc> ... <dst>] [-copyFromLocal <localsrc> ... <dst>] [-moveFromLocal <localsrc> ... <dst>] [-get [-ignoreCrc] [-crc] <src> <localdst> [-getmerge <src> <localdst> [addnl]] [-cat <src>] [-copyToLocal [-ignoreCrc] [-crc] <src> <localdst>] [-moveToLocal <src> <localdst>] [-mkdir <path>] [-report] [-setrep [-R] [-w] <rep> <path/file>] [-touchz <path>] [-test -[ezd] <path>] [-stat [format] <path>] [-tail [-f] <path>] [-text <path>] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-chgrp [-R] GROUP PATH...] [-count[-q] <path>] [-help [cmd]] 40
  40. 40. HDFS Java API // Get default file system instance fs = Filesystem.get(new Configuration()); // Or Get file system instance from URI fs = Filesystem.get(URI.create(uri), new Configuration()); // Create, open, list, … OutputStream out = fs.create(path, …); InputStream in = fs.open(path, …); boolean isDone = fs.delete(path, recursive); FileStatus[] fstat = fs.listStatus(path); 41
  41. 41. libHDFS #include “hdfs.h” hdfsFS fs = hdfsConnectNewInstance("default", 0); hdfsFile writeFile = hdfsOpenFile(fs, “/tmp/test.txt”, O_WRONLY|O_CREAT, 0, 0, 0); tSize num_written = hdfsWrite(fs, writeFile, (void*)buffer, sizeof(buffer)); hdfsCloseFile(fs, writeFile); hdfsFile readFile = hdfsOpenFile(fs, “/tmp/test.txt”, O_RDONLY, 0, 0, 0); tSize num_read = hdfsRead(fs, readFile, (void*)buffer, sizeof(buffer)); hdfsCloseFile(fs, readFile); hdfsDisconnect(fs); 42
  42. 42. Installing Hadoop • Check requirements • Java 1.6+ • bash (Cygwin on Windows) • Download Hadoop release • Change configuration • Launch daemons Middleware 2009 43
  43. 43. Download Hadoop $ wget http://www.apache.org/dist/hadoop/core/ hadoop-0.18.3/hadoop-0.18.3.tar.gz $ tar zxvf hadoop-0.18.3.tar.gz $ cd hadoop-0.18.3 $ ls -cF conf commons-logging.properties hadoop-site.xml configuration.xsl log4j.properties hadoop-default.xml masters hadoop-env.sh slaves hadoop-metrics.properties sslinfo.xml.example 44
  44. 44. Set Environment # Modify conf/hadoop-env.sh $ export JAVA_HOME=.... $ export HADOOP_HOME=.... $ export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves $ export HADOOP_CONF_DIR=${HADOOP_HOME}/conf # Enable password-less ssh # Assuming $HOME is shared across all nodes $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys 45
  45. 45. Make Directories # On Namenode, create metadata storage and tmp space $ mkdir -p /home/hadoop/dfs/name $ mkdir -p /tmp/hadoop # Create “slaves” file $ cat > conf/slaves slave00 slave01 slave02 ... ^D # Create data directories on each slave $ bin/slaves.sh "mkdir -p /tmp/hadoop" $ bin/slaves.sh "mkdir -p /home/hadoop/dfs/data" 46
  46. 46. Start Daemons # Modify hadoop-site.xml with appropriate # fs.default.name, mapred.job.tracker, etc. $ mv ~/myconf.xml conf/hadoop-site.xml # On Namenode $ bin/hadoop namenode -format # Start all daemons $ bin/start-all.sh # Done ! 47
  47. 47. Check Namenode
  48. 48. Cluster Summary
  49. 49. Browse Filesystem
  50. 50. Browse Filesystem
  51. 51. Browse Filesystem
  52. 52. Questions ?
  53. 53. Hadoop MapReduce 54
  54. 54. Think MR • Record = (Key,Value) • Key : Comparable, Serializable • Value: Serializable • Input, Map, Shuffle, Reduce, Output Middleware 2009 55
  55. 55. Seems Familiar ? cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c > ~/userlist 56
  56. 56. Map • Input: (Key ,Value ) 1 1 • Output: List(Key ,Value ) 2 2 • Projections, Filtering, Transformation Middleware 2009 57
  57. 57. Shuffle • Input: List(Key ,Value ) 2 2 • Output • Sort(Partition(List(Key , List(Value )))) 2 2 • Provided by Hadoop Middleware 2009 58
  58. 58. Reduce • Input: List(Key , List(Value )) 2 2 • Output: List(Key ,Value ) 3 3 • Aggregation Middleware 2009 59
  59. 59. Example: Unigrams • Input: Huge text corpus • Wikipedia Articles (40GB uncompressed) • Output: List of words sorted in descending order of frequency Middleware 2009 60
  60. 60. Unigrams $ cat ~/wikipedia.txt | sed -e 's/ /n/g' | grep . | sort | uniq -c > ~/frequencies.txt $ cat ~/frequencies.txt | # cat | sort -n -k1,1 -r | # cat > ~/unigrams.txt 61
  61. 61. MR for Unigrams mapper (filename, file-contents): for each word in file-contents: emit (word, 1) reducer (word, values): sum = 0 for each value in values: sum = sum + value emit (word, sum) 62
  62. 62. MR for Unigrams mapper (word, frequency): emit (frequency, word) reducer (frequency, words): for each word in words: emit (word, frequency) 63
  63. 63. Dataflow
  64. 64. MR Dataflow
  65. 65. Unigrams: Java Mapper public static class MapClass extends MapReduceBase implements Mapper <LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { Text word = new Text(itr.nextToken()); output.collect(word, new IntWritable(1)); } } } 66
  66. 66. Unigrams: Java Reducer public static class Reduce extends MapReduceBase implements Reducer <Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } 67
  67. 67. Unigrams: Driver public void run(String inputPath, String outputPath) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); FileInputFormat.addInputPath(conf, new Path(inputPath)); FileOutputFormat.setOutputPath(conf, new Path(outputPath)); JobClient.runJob(conf); } 68
  68. 68. MapReduce Pipeline
  69. 69. Pipeline Details
  70. 70. Configuration • Unified Mechanism for • Configuring Daemons • Runtime environment for Jobs/Tasks • Defaults: *-default.xml • Site-Specific: *-site.xml • final parameters Middleware 2009 71
  71. 71. Example <configuration> <property> <name>mapred.job.tracker</name> <value>head.server.node.com:9001</value> </property> <property> <name>fs.default.name</name> <value>hdfs://head.server.node.com:9000</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> <final>true</final> </property> .... </configuration> 72
  72. 72. InputFormats Format Key Type Value Type TextInputFormat File Offset Text Line (Default) KeyValueInputFormat Text (upto t) Remaining Text SequenceFileInputFormat User-Defined User-Defined
  73. 73. OutputFormats Format Description TextOutputFormat Key t Value n (default) Binary Serialized keys and SequenceFileOutputFormat values NullOutputFormat Discards Output
  74. 74. Hadoop Streaming • Hadoop is written in Java • Java MapReduce code is “native” • What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers • Text Input and Output Middleware 2009 75
  75. 75. Hadoop Streaming • Thin Java wrappers for Map & Reduce Tasks • Forks actual Mapper & Reducer • IPC via stdin, stdout, stderr • Key.toString() t Value.toString() n • Slower than Java programs • Allows for quick prototyping / debugging Middleware 2009 76
  76. 76. Hadoop Streaming $ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh # mapper.sh sed -e 's/ /n/g' | grep . # reducer.sh uniq -c | awk '{print $2 "t" $1}' 77
  77. 77. Hadoop Pipes • Library for C/C++ • Key & Value are std::string (binary) • Communication through Unix pipes • High numerical performance • legacy C/C++ code (needs modification) Middleware 2009 78
  78. 78. Pipes Program #include "hadoop/Pipes.hh" #include "hadoop/TemplateFactory.hh" #include "hadoop/StringUtils.hh" int main(int argc, char *argv[]) { return HadoopPipes::runTask( HadoopPipes::TemplateFactory<WordCountMap, WordCountReduce>()); } 79
  79. 79. Pipes Mapper class WordCountMap: public HadoopPipes::Mapper { public: WordCountMap(HadoopPipes::TaskContext& context){} void map(HadoopPipes::MapContext& context) { std::vector<std::string> words = HadoopUtils::splitString( context.getInputValue(), " "); for(unsigned int i=0; i < words.size(); ++i) { context.emit(words[i], "1"); } } }; 80
  80. 80. Pipes Reducer class WordCountReduce: public HadoopPipes::Reducer { public: WordCountReduce(HadoopPipes::TaskContext& context){} void reduce(HadoopPipes::ReduceContext& context) { int sum = 0; while (context.nextValue()) { sum += HadoopUtils::toInt(context.getInputValue()); } context.emit(context.getInputKey(), HadoopUtils::toString(sum)); } }; 81
  81. 81. Running Pipes # upload executable to HDFS $ bin/hadoop fs -put wordcount /examples/bin # Specify configuration $ vi /tmp/word.xml ... // Set the binary path on DFS <property> <name>hadoop.pipes.executable</name> <value>/examples/bin/wordcount</value> </property> ... # Execute job # bin/hadoop pipes -conf /tmp/word.xml -input in-dir -output out-dir 82
  82. 82. MR Architecture
  83. 83. Job Submission
  84. 84. Initialization
  85. 85. Scheduling
  86. 86. Execution
  87. 87. Map Task
  88. 88. Sort Buffer
  89. 89. Reduce Task
  90. 90. Questions ?
  91. 91. Running Hadoop Jobs 92
  92. 92. Running a Job [milindb@gateway ~]$ hadoop jar $HADOOP_HOME/hadoop-examples.jar wordcount /data/newsarchive/20080923 /tmp/newsout input.FileInputFormat: Total input paths to process : 4 mapred.JobClient: Running job: job_200904270516_5709 mapred.JobClient: map 0% reduce 0% mapred.JobClient: map 3% reduce 0% mapred.JobClient: map 7% reduce 0% .... mapred.JobClient: map 100% reduce 21% mapred.JobClient: map 100% reduce 31% mapred.JobClient: map 100% reduce 33% mapred.JobClient: map 100% reduce 66% mapred.JobClient: map 100% reduce 100% mapred.JobClient: Job complete: job_200904270516_5709 93
  93. 93. Running a Job mapred.JobClient: Counters: 18 mapred.JobClient: Job Counters mapred.JobClient: Launched reduce tasks=1 mapred.JobClient: Rack-local map tasks=10 mapred.JobClient: Launched map tasks=25 mapred.JobClient: Data-local map tasks=1 mapred.JobClient: FileSystemCounters mapred.JobClient: FILE_BYTES_READ=491145085 mapred.JobClient: HDFS_BYTES_READ=3068106537 mapred.JobClient: FILE_BYTES_WRITTEN=724733409 mapred.JobClient: HDFS_BYTES_WRITTEN=377464307 94
  94. 94. Running a Job mapred.JobClient: Map-Reduce Framework mapred.JobClient: Combine output records=73828180 mapred.JobClient: Map input records=36079096 mapred.JobClient: Reduce shuffle bytes=233587524 mapred.JobClient: Spilled Records=78177976 mapred.JobClient: Map output bytes=4278663275 mapred.JobClient: Combine input records=371084796 mapred.JobClient: Map output records=313041519 mapred.JobClient: Reduce input records=15784903 95
  95. 95. JobTracker WebUI
  96. 96. JobTracker Status
  97. 97. Jobs Status
  98. 98. Job Details
  99. 99. Job Counters
  100. 100. Job Progress
  101. 101. All Tasks
  102. 102. Task Details
  103. 103. Task Counters
  104. 104. Task Logs
  105. 105. Debugging • Run job with the Local Runner • Set mapred.job.tracker to “local” • Runs application in a single thread • Run job on a small data set on a 1 node cluster Middleware 2009 106
  106. 106. Debugging • Set keep.failed.task.files to keep files from failed tasks • Use the IsolationRunner to run just the failed task • Java Debugging hints • Send a kill -QUIT to the Java process to get the call stack, locks held, deadlocks Middleware 2009 107
  107. 107. Hadoop Performance Tuning 108
  108. 108. Example • “Bob” wants to count records in AdServer logs (several hundred GB) • Used Identity Mapper & Single counting reducer • What is he doing wrong ? • This happened, really ! Middleware 2009 109
  109. 109. MapReduce Performance • Reduce intermediate data size • map outputs + reduce inputs • Maximize map input transfer rate • Pipelined writes from reduce • Opportunity to load balance Middleware 2009 110
  110. 110. Shuffle • Often the most expensive component • M * R Transfers over the network • Sort map outputs (intermediate data) • Merge reduce inputs Middleware 2009 111
  111. 111. Improving Shuffle • Avoid shuffling/sorting if possible • Minimize redundant transfers • Compress intermediate data Middleware 2009 112
  112. 112. Avoid Shuffle • Set mapred.reduce.tasks to zero • Known as map-only computations • Filters, Projections, Transformations • Number of output files = number of input splits = number of input blocks • May overwhelm namenode Middleware 2009 113
  113. 113. Minimize Redundant Transfers • Combiners • Intermediate data compression Middleware 2009 114
  114. 114. Combiners • When Maps produce many repeated keys • Combiner: Local aggregation after Map & before Reduce • Side-effect free • Same interface as Reducers, and often the same class Middleware 2009 115
  115. 115. Compression • Often yields huge performance gains • Set mapred.output.compress to true to compress job output • Set mapred.compress.map.output to true to compress map outputs • Codecs: Java zlib (default), LZO, bzip2, native gzip Middleware 2009 116
  116. 116. Load Imbalance • Inherent in application • Imbalance in input splits • Imbalance in computations • Imbalance in partitions • Heterogenous hardware • Degradation over time Middleware 2009 117
  117. 117. Optimal Number of Nodes • T = Map slots per TaskTracker m • N = optimal number of nodes • S = N * T = Total Map slots in cluster m m • M = Map tasks in application • Rule of thumb: 5*S < M < 10*S m m Middleware 2009 118
  118. 118. Configuring Task Slots • mapred.tasktracker.map.tasks.maximum • mapred.tasktracker.reduce.tasks.maximum • Tradeoffs: Number of cores, RAM, number and size of disks • Also consider resources consumed by TaskTracker & DataNode Middleware 2009 119
  119. 119. Speculative Execution • Runs multiple instances of slow tasks • Instance that finishes first, succeeds • mapred.map.speculative.execution=true • mapred.reduce.speculative.execution=true • Can dramatically bring in long tails on jobs Middleware 2009 120
  120. 120. Hadoop Examples 121
  121. 121. Example: Standard Deviation • Takeaway: Changing algorithm to suit architecture yields the best implementation
  122. 122. Implementation 1 • Two Map-Reduce stages • First stage computes Mean • Second stage computes standard deviation Middleware 2009 123
  123. 123. Stage 1: Compute Mean • Map Input (x for i = 1 ..N ) i m • Map Output (N , Mean(x )) m 1..Nm • Single Reducer • Reduce Input (Group(Map Output)) • Reduce Output (Mean(x )) 1..N Middleware 2009 124
  124. 124. Stage 2: Compute Standard Deviation • Map Input (x for i = 1 ..N ) & Mean(x ) i m 1..N • Map Output (Sum(x – Mean(x)) for i = i 2 1 ..Nm • Single Reducer • Reduce Input (Group (Map Output)) & N • Reduce Output (σ) Middleware 2009 125
  125. 125. Standard Deviation • Algebraically equivalent • Be careful about numerical accuracy, though
  126. 126. Implementation 2 • Map Input (x for i = 1 ..N ) i m • Map Output (N , m [Sum(x 2 1..Nm),Mean(x1..Nm)]) • Single Reducer • Reduce Input (Group (Map Output)) • Reduce Output (σ) Middleware 2009 127
  127. 127. NGrams
  128. 128. Bigrams • Input: A large text corpus • Output: List(word , Top (word )) 1 K 2 • Two Stages: • Generate all possible bigrams • Find most frequent K bigrams for each word Middleware 2009 129
  129. 129. Bigrams: Stage 1 Map • Generate all possible Bigrams • Map Input: Large text corpus • Map computation • In each sentence, or each “word word ” 1 2 • Output (word , word ), (word , word ) 1 2 2 1 • Partition & Sort by (word , word ) 1 2 Middleware 2009 130
  130. 130. pairs.pl while(<STDIN>) { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^s+//g ; $_ =~ s/s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/s+/, $_); for (my $i = 0; $i < $#words - 1; ++$i) { print "$words[$i]:$words[$i+1]n"; print "$words[$i+1]:$words[$i]n"; } } 131
  131. 131. Bigrams: Stage 1 Reduce • Input: List(word , word ) sorted and 1 2 partitioned • Output: List(word , [freq, word ]) 1 2 • Counting similar to Unigrams example Middleware 2009 132
  132. 132. count.pl $_ = <STDIN>; chomp; my ($pw1, $pw2) = split(/:/, $_); $count = 1; while(<STDIN>) { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 && $w2 eq $pw2) { $count++; } else { print "$pw1:$count:$pw2n"; $pw1 = $w1; $pw2 = $w2; $count = 1; } } print "$pw1:$count:$pw2n"; 133
  133. 133. Bigrams: Stage 2 Map • Input: List(word , [freq,word ]) 1 2 • Output: List(word , [freq, word ]) 1 2 • Identity Mapper (/bin/cat) • Partition by word 1 • Sort descending by (word , freq) 1 Middleware 2009 134
  134. 134. Bigrams: Stage 2 Reduce • Input: List(word , [freq,word ]) 1 2 • partitioned by word 1 • sorted descending by (word , freq) 1 • Output: Top (List(word , [freq, word ])) K 1 2 • For each word, throw away after K records Middleware 2009 135
  135. 135. firstN.pl $N = 5; $_ = <STDIN>; chomp; my ($pw1, $count, $pw2) = split(/:/, $_); $idx = 1; $out = "$pw1t$pw2,$count;"; while(<STDIN>) { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx < $N) { $out .= "$w2,$c;"; $idx++; } } else { print "$outn"; $pw1 = $w1; $idx = 1; $out = "$pw1t$w2,$c;"; } } print "$outn"; 136
  136. 136. Partitioner • By default, evenly distributes keys • hashcode(key) % NumReducers • Overriding partitioner • Skew in map-outputs • Restrictions on reduce outputs • All URLs in a domain together Middleware 2009 137
  137. 137. Partitioner // JobConf.setPartitionerClass(className) public interface Partitioner <K, V> extends JobConfigurable { int getPartition(K key, V value, int maxPartitions); } 138
  138. 138. Fully Sorted Output • By contract, reducer gets input sorted on key • Typically reducer output order is the same as input order • Each output file (part file) is sorted • How to make sure that Keys in part i are all less than keys in part i+1 ? Middleware 2009 139
  139. 139. Fully Sorted Output • Use single reducer for small output • Insight: Reducer input must be fully sorted • Partitioner should provide fully sorted reduce input • Sampling + Histogram equalization Middleware 2009 140
  140. 140. Number of Maps • Number of Input Splits • Number of HDFS blocks • mapred.map.tasks • Minimum Split Size (mapred.min.split.size) • split_size = max(min(hdfs_block_size, data_size/#maps), min_split_size) Middleware 2009 141
  141. 141. Parameter Sweeps • External program processes data based on command-line parameters • ./prog –params=“0.1,0.3” < in.dat > out.dat • Objective: Run an instance of ./prog for each parameter combination • Number of Mappers = Number of different parameter combinations Middleware 2009 142
  142. 142. Parameter Sweeps • Input File: params.txt • Each line contains one combination of parameters • Input format is NLineInputFormat (N=1) • Number of maps = Number of splits = Number of lines in params.txt Middleware 2009 143
  143. 143. Auxiliary Files • -file auxFile.dat • Job submitter adds file to job.jar • Unjarred on the task tracker • Available to task as $cwd/auxFile.dat • Not suitable for large / frequently used files Middleware 2009 144
  144. 144. Auxiliary Files • Tasks need to access “side” files • Read-only Dictionaries (such as for porn filtering) • Dynamically linked libraries • Tasks themselves can fetch files from HDFS • Not Always ! (Hint: Unresolved symbols) Middleware 2009 145
  145. 145. Distributed Cache • Specify “side” files via –cacheFile • If lot of such files needed • Create a tar.gz archive • Upload to HDFS • Specify via –cacheArchive Middleware 2009 146
  146. 146. Distributed Cache • TaskTracker downloads these files “once” • Untars archives • Accessible in task’s $cwd before task starts • Cached across multiple tasks • Cleaned up upon exit Middleware 2009 147
  147. 147. Joining Multiple Datasets • Datasets are streams of key-value pairs • Could be split across multiple files in a single directory • Join could be on Key, or any field in Value • Join could be inner, outer, left outer, cross product etc • Join is a natural Reduce operation Middleware 2009 148
  148. 148. Example • A = (id, name), B = (name, address) • A is in /path/to/A/part-* • B is in /path/to/B/part-* • Select A.name, B.address where A.name == B.name Middleware 2009 149
  149. 149. Map in Join • Input: (Key ,Value ) from A or B 1 1 • map.input.file indicates A or B • MAP_INPUT_FILE in Streaming • Output: (Key , [Value , A|B]) 2 2 • Key is the Join Key 2 Middleware 2009 150
  150. 150. Reduce in Join • Input: Groups of [Value , A|B] for each Key 2 2 • Operation depends on which kind of join • Inner join checks if key has values from both A & B • Output: (Key , JoinFunction(Value ,…)) 2 2 Middleware 2009 151
  151. 151. MR Join Performance • Map Input = Total of A & B • Map output = Total of A & B • Shuffle & Sort • Reduce input = Total of A & B • Reduce output = Size of Joined dataset • Filter and Project in Map Middleware 2009 152
  152. 152. Join Special Cases • Fragment-Replicate • 100GB dataset with 100 MB dataset • Equipartitioned Datasets • Identically Keyed • Equal Number of partitions • Each partition locally sorted Middleware 2009 153
  153. 153. Fragment-Replicate • Fragment larger dataset • Specify as Map input • Replicate smaller dataset • Use Distributed Cache • Map-Only computation • No shuffle / sort Middleware 2009 154
  154. 154. Equipartitioned Join • Available since Hadoop 0.16 • Datasets joined “before” input to mappers • Input format: CompositeInputFormat • mapred.join.expr • Simpler to use in Java, but can be used in Streaming Middleware 2009 155
  155. 155. Example mapred.join.expr = inner ( tbl ( ....SequenceFileInputFormat.class, "hdfs://namenode:8020/path/to/data/A" ), tbl ( ....SequenceFileInputFormat.class, "hdfs://namenode:8020/path/to/data/B" ) ) 156
  156. 156. Questions ?
  157. 157. Apache Pig
  158. 158. What is Pig? • System for processing large semi- structured data sets using Hadoop MapReduce platform • Pig Latin: High-level procedural language • Pig Engine: Parser, Optimizer and distributed query execution Middleware 2009 159
  159. 159. Pig vs SQL • Pig is procedural • SQL is declarative • Nested relational data • Flat relational data model model • Schema is optional • Schema is required • Scan-centric analytic • OLTP + OLAP workloads workloads • Limited query • Significant opportunity optimization for query optimization 160
  160. 160. Pig vs Hadoop • Increases programmer productivity • Decreases duplication of effort • Insulates against Hadoop complexity • Version Upgrades • JobConf configuration tuning • Job Chains Middleware 2009 161
  161. 161. Example • Input: User profiles, Page visits • Find the top 5 most visited pages by users aged 18-25
  162. 162. In Native Hadoop
  163. 163. In Pig Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; 164
  164. 164. Natural Fit
  165. 165. Comparison
  166. 166. Flexibility & Control • Easy to plug-in user code • Metadata is not mandatory • Does not impose a data model • Fine grained control • Complex data types Middleware 2009 167
  167. 167. Pig Data Types • Tuple: Ordered set of fields • Field can be simple or complex type • Nested relational model • Bag: Collection of tuples • Can contain duplicates • Map: Set of (key, value) pairs Middleware 2009 168
  168. 168. Simple data types • int : 42 • long : 42L • float : 3.1415f • double : 2.7182818 • chararray : UTF-8 String • bytearray : blob Middleware 2009 169
  169. 169. Expressions A = LOAD ‘data.txt’ AS (f1:int , f2:{t:(n1:int, n2:int)}, f3: map[] ) A = { ( 1, -- A.f1 or A.$0 { (2, 3), (4, 6) }, -- A.f2 or A.$1 [ ‘yahoo’#’mail’ ] -- A.f3 or A.$2 ) } 170
  170. 170. Pig Unigrams • Input: Large text document • Process: • Load the file • For each line, generate word tokens • Group by word • Count words in each group Middleware 2009 171
  171. 171. Load myinput = load '/user/milindb/text.txt' USING TextLoader() as (myword:chararray); { (program program) (pig pig) (program pig) (hadoop pig) (latin latin) (pig latin) } 172
  172. 172. Tokenize words = FOREACH myinput GENERATE FLATTEN(TOKENIZE(*)); { (program) (program) (pig) (pig) (program) (pig) (hadoop) (pig) (latin) (latin) (pig) (latin) } 173
  173. 173. Group grouped = GROUP words BY $0; { (pig, {(pig), (pig), (pig), (pig), (pig)}) (latin, {(latin), (latin), (latin)}) (hadoop, {(hadoop)}) (program, {(program), (program), (program)}) } 174
  174. 174. Count counts = FOREACH grouped GENERATE group, COUNT(words); { (pig, 5L) (latin, 3L) (hadoop, 1L) (program, 3L) } 175
  175. 175. Store store counts into ‘/user/milindb/output’ using PigStorage(); pig 5 latin 3 hadoop 1 program 3 176
  176. 176. Example: Log Processing -- use a custom loader Logs = load ‘/var/log/access_log’ using CommonLogLoader() as (addr, logname, user, time, method, uri, p, bytes); -- apply your own function Cleaned = foreach Logs generate addr, canonicalize(url) as url; Grouped = group Cleaned by url; -- run the result through a binary Analyzed = stream Grouped through ‘urlanalyzer.py’; store Analyzed into ‘analyzedurls’; 177
  177. 177. Schema on the fly -- declare your types Grades = load ‘studentgrades’ as (name: chararray, age: int, gpa: double); Good = filter Grades by age > 18 and gpa > 3.0; -- ordering will be by type Sorted = order Good by gpa; store Sorted into ‘smartgrownups’; 178
  178. 178. Nested Data Logs = load ‘weblogs’ as (url, userid); Grouped = group Logs by url; -- Code inside {} will be applied to each -- value in turn. DisinctCount = foreach Grouped { Userid = Logs.userid; DistinctUsers = distinct Userid; generate group, COUNT(DistinctUsers); } store DistinctCount into ‘distinctcount’; 179
  179. 179. Pig Architecture
  180. 180. Pig Stages
  181. 181. Logical Plan • Directed Acyclic Graph • Logical Operator as Node • Data flow as edges • Logical Operators • One per Pig statement • Type checking with Schema Middleware 2009 182
  182. 182. Pig Statements Read data from the file Load system Write data to the file Store system Dump Write data to stdout
  183. 183. Pig Statements Apply expression to each record and Foreach..Generate generate one or more records Apply predicate to each Filter record and remove records where false Stream records through Stream..through user-provided binary
  184. 184. Pig Statements Collect records with Group/CoGroup the same key from one or more inputs Join two or more inputs Join based on a key Sort records based on a Order..by key
  185. 185. Physical Plan • Pig supports two back-ends • Local • Hadoop MapReduce • 1:1 correspondence with most logical operators • Except Distinct, Group, Cogroup, Join etc Middleware 2009 186
  186. 186. MapReduce Plan • Detect Map-Reduce boundaries • Group, Cogroup, Order, Distinct • Coalesce operators into Map and Reduce stages • Job.jar is created and submitted to Hadoop JobControl Middleware 2009 187
  187. 187. Lazy Execution • Nothing really executes until you request output • Store, Dump, Explain, Describe, Illustrate • Advantages • In-memory pipelining • Filter re-ordering across multiple commands Middleware 2009 188
  188. 188. Parallelism • Split-wise parallelism on Map-side operators • By default, 1 reducer • PARALLEL keyword • group, cogroup, cross, join, distinct, order Middleware 2009 189
  189. 189. Running Pig $ pig grunt > A = load ‘students’ as (name, age, gpa); grunt > B = filter A by gpa > ‘3.5’; grunt > store B into ‘good_students’; grunt > dump A; (jessica thompson, 73, 1.63) (victor zipper, 23, 2.43) (rachel hernandez, 40, 3.60) grunt > describe A; A: (name, age, gpa ) 190
  190. 190. Running Pig • Batch mode • $ pig myscript.pig • Local mode • $ pig –x local • Java mode (embed pig statements in java) • Keep pig.jar in the class path Middleware 2009 191
  191. 191. PigPen
  192. 192. PigPen
  193. 193. Pig for SQL Programmers 194
  194. 194. SQL to Pig SQL Pig A = LOAD ‘MyTable’ USING PigStorage(‘t’) AS ...FROM MyTable... (col1:int, col2:int, col3:int); SELECT col1 + B = FOREACH A GENERATE col1 + col2, col3; col2, col3 ... ...WHERE col2 > 2 C = FILTER B by col2 > 2;
  195. 195. SQL to Pig SQL Pig D = GROUP A BY (col1, col2) SELECT col1, col2, sum(col3) E = FOREACH D GENERATE FROM X GROUP BY col1, col2 FLATTEN(group), SUM(A.col3); ...HAVING sum(col3) > 5 F = FILTER E BY $2 > 5; ...ORDER BY col1 G = ORDER F BY $0;
  196. 196. SQL to Pig SQL Pig SELECT DISTINCT col1 I = FOREACH A GENERATE col1; from X J = DISTINCT I; K = GROUP A BY col1; SELECT col1, L = FOREACH K { count(DISTINCT col2) M = DISTINCT A.col2; FROM X GROUP BY col1 GENERATE FLATTEN(group), count(M); }
  197. 197. SQL to Pig SQL Pig N = JOIN A by col1 INNER, B by col1 INNER; O = FOREACH N GENERATE A.col1, B.col3; SELECT A.col1, B. -- Or col3 FROM A JOIN B USING (col1) N = COGROUP A by col1 INNER, B by col1 INNER; O = FOREACH N GENERATE flatten(A), flatten(B); P = FOREACH O GENERATE A.col1, B.col3
  198. 198. Questions ?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×