SlideShare a Scribd company logo
1 of 54
Download to read offline
The analytics stack
  Hadoop & Pig
Outline of the presentation

 Hadoop
  Motivations. What is it? And high-level concepts
  The Ecosystem. The MapReduce model & framework
   and HDFS
  Programming with Hadoop
 Pig
  What is it? Motivations
  Model & components
 Integration with Cassandra
                             2
Please interrupt and ask questions!




                  3
Traditional HPC systems

 CPU-intensive computations
   Relatively small amount of data
   Tightly-coupled applications
   Highly concurrent I/O requirements
   Complex message passing paradigms such as MPI,
    PVM…
   Developers might need to spend some time
    designing for failure
                        4
Challenges

 Data and storage
   Locality, computation close to the data

 In large-scale systems, nodes fail
   Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day
   Built-in fault-tolerance

 Distributed programming is complex
   Need a simple data-parallel programming model. Users would
     structure the application in high-level functions, the system
     distributes the data & jobs and handles communications and faults

                                   5
What requirements

 A simple data-parallel programming model, designed for
  high scalability and resiliency
      Scalability to large-scale data volumes
      Automated fault-tolerance at application level rather
       than relying on high-availability hardware
      Simplified I/O and tasks monitoring
      All based on cost-efficient commodity machines (cheap,
       but unreliable), and commodity network


                              6
Hadoop’s core concepts

 Data spread in advance, persistent (in terms of
  locality), and replicated
 No inter-dependencies / shared nothing architecture
 Applications written in two pieces of code
   And developers do not have to worry about the
    underlying issues in networking, jobs interdependencies,
    scheduling, etc…


                             7
Where does it come from?

 Hadoop originated from Apache Nutch, an open source
  web search engine
 After the publications of the GFS and MapReduce papers,
  in 2003 & 2004, the Nutch developers decided to
  implement open source versions
 In February 2006, it became Hadoop, with a dedicated
  team at Yahoo!
 September 2007 - release 0.14.1
 Last release 1.0.3 out last week
 Used by a large number of companies including Facebook,
  LinkedIn, Twitter, hulu, among many others..
                            8
The model

 A map function processes a key/value pair to generate a set of
  intermediate key/value pairs
   Divides the problem into smaller ‘intermediate key/value’ pairs
 The reduce function merge all intermediate values associated with
  the same intermediate key
 Run-time system takes care of:
   Partitioning the input data across nodes (blocks/chunks typically of
    64Mb to 128Mb)
   Scheduling the data and execution. Maps operate on a single block.
   Manages node failures, replication, re-submissions..

                                   9
Simple Word Count
♯key: offset, value: line
def mapper():
   for line in open(“doc”):
       for word in line.split():
        output(word, 1)

♯key: a word, value: iterator over counts
def reducer():
    output(key, sum(value))


                              10
The Combiner

 A combiner is a local aggregation function for repeated keys
  produced by the map
 Works for associative functions like sum, count, max

 Decreases the size of intermediate data / communications

 map-side aggregation for word count:
         def combiner():
                 output(key, sum(values))


                               11
Some other basic examples…
 Distributed Grep:
   Map function emits a line if it matches a supplied pattern
   Reduce function is an identity function that copies the supplied intermediate
    data to the output
 Count of URL accesses:
   Map function processes logs of web page requests and outputs <URL, 1>
   Reduce function adds together all values for the same URL, emitting <URL, total
    count> pairs
 Reverse Web-Link graph:
   Map function outputs <tgt, src> for each link to a tgt in a page named src
   Reduce concatenates the list of all src URLS associated with a given tgt URL and
    emits the pair: <tgt, list(src)>
 Inverted Index:
   Map function parses each document, emitting a sequence of <word, doc_ID>
   Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair

                                         12
Hadoop Ecosystem




    Core      13

 components        from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
Hadoop components

 Hadoop consists of two core components
   The MapReduce framework, and
   The Hadoop Distributed File System

 MapReduce layer
   JobTracker
   TaskTrackers
 HDFS layer
   Namenode
   Secondary namenode
   Datanode
                                    Example of a typical physical distribution within a
                               14   Hadoop cluster
HDFS

 Scalable and fault-tolerant. Based on               Namenode
  Google’s GFS                                                    File1
                                                                   1
 Single namenode stores metadata (file                            2
                                                                   3
  names, block locations, etc.).                                   4

 Files split into chunks, replicated across
  several datanodes (typically 3+). It is rack-
  aware
 Optimised for large files, sequential
                                                  1    2    1     3
  streaming reads, rather than random             2    1    4     2
                                                  4    3    3     4
 Files written once, no append
                                                      Datanodes
                                       15
HDFS

  HDFS API / HDFS FS Shell for command line*
    > hadoop fs –copyFromLocal local_dir hdfs_dir
    > hadoop fs –copToLocal hdfs_dir local_dir


  Tools
     Flume: collects, aggregates and move log data from application
      servers to HDFS
     Sqoop: HDFS import and export to SQL

*http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html
                                            16
MapReduce execution

 In Hadoop, a Job (full program) is a set of tasks
 Each task (mapper or reducer) is attempted at least once, or
  multiple times if it crashes. Multiple attempts may also occur
  in parallel
 The tasks run inside a separate JVM on the tasktracker

 All the class files are assembled into a jar file, which will be
  uploaded into HDFS, before notifying the tasktracker


                                  17
MapReduce execution

MapReduce Job
                                                Master




     Split 0            Worker
     Split 1                                                                 Worker
                 read            Local write
     Split 2            Worker                                 Remote read
                                                                             Worker
     Split 3
     Split 4            Worker                                                        Output files

                                               Intermediate
   Input files                                 files locally

                                               18
Getting Started…

 Multiple choices - Vanilla Apache version, or one of the
  numerous existing distros
     hadoop.apache.org
     www.cloudera.com [A set of VMs is also provided]
     http://www.karmasphere.com/
     …

 Three ways to write jobs in Hadoop:
   Java API
   Hadoop Streaming (for Python, Perl, etc.)
   Pipes API (C++)

                                         19
Word Count in Java
public static void main(String[] args) throws Exception {
   JobConf conf = new JobConf(WordCount.class);
   conf.setJobName("wordcount");

    conf.setMapperClass(MapClass.class);
    conf.setCombinerClass(ReduceClass.class);
    conf.setReducerClass(ReduceClass.class);

    FileInputFormat.setInputPaths(conf, args[0]);
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    JobClient.runJob(conf);
}                                 20
Word Count in Java – mapper

public class MapClass extends MapReduceBase
   implements Mapper<LongWritable, Text, Text, IntWritable> {

    private final static IntWritable ONE = new IntWritable(1);

    public void map(LongWritable key, Text value,
                    OutputCollector<Text, IntWritable> out,
                    Reporter reporter) throws IOException {
      String line = value.toString();
      StringTokenizer itr = new StringTokenizer(line);
      while (itr.hasMoreTokens()) {
        out.collect(new text(itr.nextToken()), ONE);
      }
    }
}
                              21
Word Count in Java – reducer


public class ReduceClass extends MapReduceBase
   implements Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
                       OutputCollector<Text, IntWritable> out,
                       Reporter reporter) throws IOException {
      int sum = 0;
      while (values.hasNext()) {
        sum += values.next().get();
      }
      out.collect(key, new IntWritable(sum));
    }
}
                             22
Getting keys and values


                              Input file



                                                                                 Reducer        Reducer
Input Format




                Input split                 Input split




                                                               Output Format
                                                                               RecordWriter   RecordWriter
               RecordReader                RecordReader



                                                                                Output file    Output file
                 Mapper                      Mapper




                                                          23
Hadoop Streaming
   Mapper.py:     #!/usr/bin/env python

                  import sys
                  for line in sys.stdin:
                    for word in line.split():
                          print "%st%s" % (word, 1)

 Reducer.py:      #!/usr/bin/env python
                  import sys
                  dict={}
                  for line in sys.stdin:
                    word, count = line.split("t", 1)
                    dict[word] = dict.get(word, 0) + int(count)
                  counts = dict.items()
                  for word, count in counts:
                    print "%st%s" % (word.lower(), count)
You can locally test your code on the command line:
          $> cat data | mapper | sort | reducer
                                   24
High-level tools

 MapReduce is fairly low-level: must think about
  keys, values, partitioning, etc.
 How to express parallel algorithms by a series of
  MapReduce jobs
   Can be hard to capture common job building blocks

 Different use cases require different tools!

                             25
Pig

 Apache Pig is a platform raising a level of abstraction for
  processing large datasets. Its language, Pig Latin is a simple
  query algebra expressing data transformations and applying
  functions to records


        Pig
                           MapReduce jobs            Hadoop / HDFS
  job submission



 Started at Yahoo! Research, >60% of Hadoop jobs within
  Yahoo! are Pig jobs
                                26
Motivations

 MapReduce requires a Java programmer
   Solution was to abstract it and create a system where users are
    familiar with scripting languages

 Other than very trivial applications, MapReduce requires
  multiple stages, leading to long development cycles
   Rapid prototyping. Increased productivity

 In MapReduce users have to reinvent common functionality
  (join, filter, etc.)
   Pig provides them
                                 27
Used for

   Rapid prototyping of algorithms for processing large datasets
   Log analysis
   Ad hoc queries across various large datasets
   Analytics (including through sampling)

 Pig Mix provides a set of performance and scalability
  benchmarks. Currently 1.1 times MapReduce speed.


                                28
Using Pig

 Grunt, the Pig shell

 Executing scripts directly

 Embedding Pig in Java (using PigServer, similar to SQL
  using JDBC), or Python

 A range of tools including Eclipse plug-ins
   PigPen, Pig Editor…


                               29
Execution modes

 Pig has two execution types or modes: local mode and
  Hadoop mode

 Local
   Pig runs in a single JVM and accesses the local filesystem.
    Starting form v0.7 it uses the Hadoop job runner.
 Hadoop mode
   Pig runs on a Hadoop cluster (you need to tell Pig about the
    version and point it to your Namenode and Jobtracker


                                 30
Running Pig

 Pig resides on the user’s machine and can be independent
  from the Hadoop cluster
 Pig is written in Java and is portable
   Compiles into map reduce jobs and submit them to the cluster
 No need to install anything extra on the cluster




  Pig client
                                31
How does it work

  Pig defines a DAG. A step-by-step set of operations, each
   performing a transformation
  Pig defines a logical plan for these transformations:

A = LOAD ’file' as (line);
                                        • Parses, checks, & optimises
B = FOREACH A GENERATE
                                        • Plan the execution
FLATTEN(TOKENIZE(line)) AS word;
                                            • Maps & Reduces
C = GROUP B BY word;
                                        • Passes the jar to Hadoop
D = FOREACH C GENERATE group,
                                        • Monitor the progress
COUNT(words);
STORE D INTO ‘output’


                                   32
Data types & expressions

 Scalar type:
   Int, Long, Float, Double, Chararray, Bytearray
 Complex type representing nested structures:
   Tuple: sequence of fields of any type
   Bag: an unordered collection of tuples
   Map: a set of key-value pairs. Keys must be atoms, values may
    be any type
 Expressions:
   used in Pig as a part of a statement; field name, position ($),
    arithmetic, conditional, comparison, Boolean, etc.
                                 33
Functions

 Load / Store
   Data loaders; PigStorage, BinStorage, BinaryStorage,
    TextLoader, PigDump
 Evaluation
   Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…
 Filter
   A special type of eval function used by the FILTER operator.
    IsEmpty is a built-in function
 Comparison
   Function used in ORDER statement; ASC | DESC

                                34
Schemas

 Schemas enable you to associate names and types of the
  fields in the relation
 Schemas are optional but recommended whenever
  possible; type declarations result in better parse-time error
  checking and more efficient code execution

 They are defined using the AS keyword with operators
   Schema definition for simple data types:
  > records = LOAD 'input/data' AS (id:int, date:chararray);


                                35
Statements and aliases

 Each statement, defining a data processing operator /
  relation, produces a dataset with an alias

grunt> records = LOAD 'input/data' AS (id:int, date:chararray);


 LOAD returns a tuple, which elements can be referenced by
  position or by name

 Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE

                                36
Filtering data

 Filter is user to work with tuples and rows of data
 Select data you want, or remove the data you are not
  interested in

 Filtering early in the processing pipeline minimises the
  amount of data flowing through the system, which can
  improve efficiency

grunt> filtered_records = FILTER records BY id == 234;


                             37
Foreach .. Generate

 Foreach .. Generate acts on columns on every row in a
  relation
grunt> ids = FOREACH records GENERATE id;


 Positional reference. This statement has the same output
grunt> ids = FOREACH records GENERATE $0;


 The elements of ‘ids’ however are not named ‘id’ unless
  you add ‘AS id’ at the end of your statement
grunt> ids = FOREACH records GENERATE $0 AS id;
                              38
Grouping and joining

 Group .. by makes an output bag containing grouped
  fields with the same schema using a grouping key

 Join performs inner, equijoin of two or more relations
  based on common field values.

 You can also perform outer joins using keywords left,
  right and full

 Cogroup is similar to Group, using multiple relations, and
  creates a nested set of output tuples
                              39
Ordering, combining, splitting…

 Order imposes an order on the output to sort a relation
  by one or more fields
 The Limit statement limits the number of results
 Split partitions a relation into two or more relations
 the Sample operator selects a random data sample with
  the stated sample size
 the Union operator to merge the contents of two or more
  relations

                              40
Stream

 The Stream operator allows to transform data in a
  relation using an external program or script

grunt> C = STREAM A THROUGH `cut -f 2`;
   Extract the second field of A using cut

 The scripts are shipped to the cluster using
  grunt> DEFINE script `script.py` SHIP (‘script.py’);
  grunt> D = STREAM C THROUGH script AS (…);


                              41
User defined functions

 Support and a community of user-defined functions (UDFs)

 UDFs can encapsulate users processing logic in filtering,
  comparison, evaluation, grouping, or storage
   filter functions for instance are all subclasses of
    FilterFunc, which itself is a subclass of EvalFunc

 PiggyBank: the Pig community sharing their UDFs
 DataFu: Linkedin's collection of Pig UDFs


                                42
A simple eval UDF example

package myudfs;

import …

public class UPPER extends EvalFunc<String>
{
  public String exec(Tuple input) throws IOException {
    if (input == null || input.size() == 0)
       return null;
    try{
       String str = (String) input.get(0);
       return str.toUpperCase();
    }catch(Exception e){
       throw WrappedIOException.wrap("Caught exception processing input row ", e);
    }
  }
}

                                          43
An Example

                                     Load Users                     Load Pages

Let’s find the top 5 most visited    Filter by age
  pages by users aged 18 – 25.
  Input: user data file, and page                     Join on name

  view data file.                                     Group on url

                                                      Count clicks

                                                     Order by clicks

                                                       Take top 5
                                44
A simple script

Users    = LOAD ‘users’ AS (name, age);
Filtered = FILTER Users BY
                  age >= 18 and age <= 25;
Pages    = LOAD ‘pages’ AS (user, url);
Joined   = JOIN Filtered BY name, Pages by user;
Grouped = GROUP Joined BY url;
Summed   = FOREACH Grouped GENERATE group,
                   count(Joined) AS clicks;
Sorted   = ORDER Summed BY clicks desc;
Top5     = LIMIT Sorted 5;

STORE Top5 INTO ‘top5sites’;

                         45
i
i
i
i
    m
    m
    m
    m
        p
        p
        p
        p

i m p o r t
            o
            o
            o
            o
                r
                r
                r
                r
                    t
                    t
                    t
                    t
                         j
                         j
                         j
                         j
                              a
                              a
                              a
                              a
                                  v
                                  v
                                  v
                                  v
                                      a
                                      a
                                      a
                                      a
                                          .
                                          .
                                          .
                                          .
                                              i
                                              u
                                              u
                                              u
                                                  o
                                                  t
                                                  t
                                                  t
                                                      .
                                                      i
                                                      i
                                                      i
                                                          I
                                                          l
                                                          l
                                                          l
                                                              O
                                                              .
                                                              .
                                                              .

              o r g . a p a c h e . h a d o o p . f s . P a t h ;
                                                                  E
                                                                  A
                                                                  I
                                                                  L
                                                                      x
                                                                      r
                                                                      t
                                                                      i
                                                                          c
                                                                          r
                                                                          e
                                                                          s
                                                                              e
                                                                              a
                                                                              r
                                                                              t
                                                                                  p t i o n ;
                                                                                  y L i s t ;
                                                                                  a t o r ;
                                                                                  ;                     / /
                                                                                                      f o r
                                                                                                              D o   t h e
                                                                                                              ( S t r i n g
                                                                                                                f o r
                                                                                                                           In MapReduce!
                                                                                                                            c r o s s
                                                                                                                              s 1   :
                                                                                                                        ( S t r i n g
                                                                                                                                        p r o d u c t
                                                                                                                                        f i r s t )
                                                                                                                                        s 2   :
                                                                                                                                                      {
                                                                                                                                                        a n

                                                                                                                                                  s e c o n
                                                                                                                                                                 }
                                                                                                                                                                         r e p o r t e r . s e t S t a t u s ( " O K " ) ;


                                                                                                                                                                                                                     d

                                                                                                                                                                                                                     d   )
                                                                                                                                                                                                                           c o l l e c t
                                                                                                                                                                                                                                           l p . s e t O u t p u t K e y C
                                                                                                                                                                                                                                           l p . s e t O u t p u t V a l u
                                                                                                                                                                                                                                           l p . s e t M a p p e r C l a s
                                                                                                                                                                                                                                           F i l e I n p u t F o r m a t .
                                                                                                                                                                                                                                           t h e   v a l u e s
                                                                                                                                                                                                                           P a t hu s e r / g a t e s / p a g e s " ) ) ;
                                                                                                                                                                                                                             {
                                                                                                                                                                                                                                   ( " /
                                                                                                                                                                                                                                           F i l e O u t p u t F o r m a t
                                                                                                                                                                                                                                                                             l
                                                                                                                                                                                                                                                                             e
                                                                                                                                                                                                                                                                             s
                                                                                                                                                                                                                                                                             a

                                                                                                                                                                                                                                                                             .
                                                                                                                                                                                                                                                                                 a
                                                                                                                                                                                                                                                                                 C
                                                                                                                                                                                                                                                                                 (
                                                                                                                                                                                                                                                                                 d

                                                                                                                                                                                                                                                                                 s
                                                                                                                                                                                                                                                                                     s
                                                                                                                                                                                                                                                                                     l
                                                                                                                                                                                                                                                                                     L
                                                                                                                                                                                                                                                                                     d
                                                                                                                                                                                                                                                                                         s
                                                                                                                                                                                                                                                                                         a
                                                                                                                                                                                                                                                                                         o
                                                                                                                                                                                                                                                                                         I
                                                                                                                                                                                                                                                                                             (
                                                                                                                                                                                                                                                                                             s
                                                                                                                                                                                                                                                                                             a
                                                                                                                                                                                                                                                                                             n

                                                                                                                                                                                                                                                                                     e t O u t
                                                                                                                                                                                                                                                                                                 T
                                                                                                                                                                                                                                                                                                 s
                                                                                                                                                                                                                                                                                                 d
                                                                                                                                                                                                                                                                                                 p
                                                                                                                                                                                                                                                                                                     e
                                                                                                                                                                                                                                                                                                     (
                                                                                                                                                                                                                                                                                                     P
                                                                                                                                                                                                                                                                                                     u
                                                                                                                                                                                                                                                                                                         x
                                                                                                                                                                                                                                                                                                         T
                                                                                                                                                                                                                                                                                                         a
                                                                                                                                                                                                                                                                                                         t
                                                                                                                                                                                                                                                                                                             t
                                                                                                                                                                                                                                                                                                             e
                                                                                                                                                                                                                                                                                                             g
                                                                                                                                                                                                                                                                                                             P
                                                                                                                                                                                                                                                                                                                 .
                                                                                                                                                                                                                                                                                                                 x
                                                                                                                                                                                                                                                                                                                 e
                                                                                                                                                                                                                                                                                                                 a
                                                                                                                                                                                                                                                                                                                     c
                                                                                                                                                                                                                                                                                                                     t
                                                                                                                                                                                                                                                                                                                     s
                                                                                                                                                                                                                                                                                                                     t

                                                                                                                                                                                                                                                                                                         p u t P a t h
                                                                                                                                                                                                                                                                                                                         l
                                                                                                                                                                                                                                                                                                                         .
                                                                                                                                                                                                                                                                                                                         .
                                                                                                                                                                                                                                                                                                                         h
                                                                                                                                                                                                                                                                                                                             a
                                                                                                                                                                                                                                                                                                                             c
                                                                                                                                                                                                                                                                                                                             c
                                                                                                                                                                                                                                                                                                                             (
                                                                                                                                                                                                                                                                                                                                 s
                                                                                                                                                                                                                                                                                                                                 l
                                                                                                                                                                                                                                                                                                                                 l
                                                                                                                                                                                                                                                                                                                                 l


i m p o r t   o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ;                                       S t r i n g   o u t v a l   =   k e                                                          y     +   " , "   +   s 1   +n e w "P a t h ( " / u
                                                                                                                                                                                                                                                     " ,     +   s 2 ;       s   e   r / g a t           e s / t m p /
i m p o r t   o r g . a p a c h e . h a d o o p . i o . T e x t ;                                                       o c . c o l l e c t ( n u l l ,   n                                                          e   w   T e x t ( o ul p . s e t N u m R e d u c e T
                                                                                                                                                                                                                                           t v a l ) ) ;                     a   s   k s ( 0 )           ;
i m p o r t   o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ;                                               r e p o r t e r . s e t S t a t u s                                                          (   " O K " ) ;       J o b   l o a d P a g e s   =     n   e   w   J o b           ( l p ) ;
i p o r t
 m           o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ;        }
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ;         }                                                                                                                                 J o b C o n f   l f u   =   n e w   J o b C o n f ( M R E x
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t};                                                                                                                                      e t J o b N a m e ( " L o a d
                                                                                                                                                                                                                                          l f u . s                    a n d   F i l t e r   U s e r s
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ;         }                                                                                                                                                 l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c
                                                                                        I n p u t F   r m a t ;     c l a s s   L o a d J o i n e d   e x t                                                           e n d s   M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a
                                                                                                                                                                                                                                          d u c e B a s e
i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ;
              a c h e .                                                                         i m p l e m e n t s   M a p p e r < T e x t ,   T e x t ,                                                             T e x t ,   L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c
                                                                                                                                                                                                                                          W r i t a b l e >   {
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ;                                                                                                                                               l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c    v o i d   m a p (                                                                                                           F i l e I nI n p u t P a t h ( l f u ,
                                                                                                                                                                                                                                                      p u t F o r m a t . a d d     n e w
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ;                       T e x t   k ,                                                                                             P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ;
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ;                                 T e x t   v a l ,                                                                                                         F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ;                           c t o r < T e x t , lL o n g W r i t a b l e >
                                                                                                                O u t p u t C o   l e                                                                                   o c ,                     n e w   P a t h ( " / u s e r / g a t e s / t m p /
i m p t
   o r     o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ;    R e p o r t e r   r e p o r t e r )   t h r                                                           o w s   I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ;
                                                                                                                                                                                                                                          p t i o n   {
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d
                                                                                                        r m   t ;       t h e   u r l                                                                                                     J o b   l o a d U s e r s   =   n e w   J o b ( l f u ) ;
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ;         S t r i n g   l i n e   =   v a l . t o S t r i n g                                                           ( ) ;
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ;           i n t   f i r s t C o m m a   =   l i n e . i n d e                                                           x O f ( ' , ' ) ;   J o b C o n f   j oM R E x a m p l e . c l a s s ) ;
                                                                                                                                                                                                                                                              i n   =   n e w   J o b C o n f (
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C
                                                o n t r o l ;                                           i n t   s e c o n d C o m m a C= ml i n e . i n d
                                                                                                                                        o   m a ) ;                                                                   e x O f ( ' , ' ,   j o i n . s e t J o b N a m e ( " J o i n
                                                                                                                                                                                                                                          f i r s t                                   U s e r s   a n
i m p o r t   o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ;   S t r i n g   k e y   =   l i n e . s u b s t r i n                                                           g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e
                                                                                                                                                                                                                                          m a ,   s e c o n d C o m m a ) ;
                                                                                                        / /   d r o p   t h e   r e s t   o f   t h e   r e                                                           c o r d ,   I   d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l
                                                                                                                                                                                                                                          n ' t   n e e d   i t   a n y m o r e ,
p u b l i c   c l a s s   M R E x a m p l e   {                                                         / /   j u s t   p a s s   a   1   f o r   t h e   c                                                           o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t .
                                                                                                                                                                                                                                          d u c e r   t o   s u m   i n s t e a d .
        p u b l i c   s t a t i c   c l a s s   L o a d P a g e s   e x t e n d s   M a p R e d u c e BT e x t
                                                                                                        a s e     o u t K e y   =   n e w   T e x t ( k e y                                                           ) ;                 j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a
                                                                                                                                                                                                                                                                  p e r . c l a s s ) ;
                i m p l e m e n t s   M a p p e r < L o n g W r i t a b l e ,   T e x t ,   T e x t ,o c . c o l l e c t ( o u t K e y ,
                                                                                                        T e x t >   {                       n e w   L o n g                                                           W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s
                                                                                                                                                                                                                                          L ) ) ;
                                                                                                }                                                                                                                                         F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
                p u b l i c   v o i d   m a p ( L o n g W r i t a b l e   k ,   T e x t} v a l ,                                                                                                                          P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) )
                             O u t p u t C o l l e c t o r < T e x t ,   T e x t >   o c , b l i c
                                                                                        p u           s t a t i c   c l a s s   R e d u c e U r l s   e x t                                                           e n d s   M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j
                                                                                                                                                                                                                                          d u c e B a s e
                                R e p o r t e r   r e p o r t e r )   t h r o w s   I O E x c ei m p l e m e n t s
                                                                                                p t i o n   {         R e d u c e r < T e x t ,   L o n g W                                                           r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " )
                                                                                                                                                                                                                          t a b l e ,   W r i t a b l e C o m p a r a b l e ,
                        / /   P u l l   t h e   k e y   o u t                   W r i t a b l e >   {                                                                                                                                     F i l e O ut O u t p u t P a t h ( j o i n ,
                                                                                                                                                                                                                                                      t p u t F o r m a t . s e           n e w
                        S t r i n g   l i n e   =   v a l . t o S t r i n g ( ) ;                                                                                                                                         P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
                        i n t   f i r s t C o m m a   =   l i n e . i n d e x O f ( ' , ' ) ;   p u b l i c   v o i d   r e d u c e (                                                                                                     j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
                        S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ;
                                  g   k e y   =   l   n e .   u b                                      y ,      T e x t   k e                                                                                                             J o b   j o i n J o b   =   n e w   J o b ( j o i n ) ;
                        S t r i n g   v a l u e   =   l i n e . s u b s t r i n g ( f i r s t C o m m a   +   1I t e r a t o r < L o n g W r i t a b l e >
                                                                                                                ) ;                                                                                                     i t e r ,         j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a
                        T e x t   o u t K e y   =   n e w   T e x t ( k e y ) ;                                 O u t p u t C o l l e c t o r < W r i t a b                                                           l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s
                                                                                                                                                                                                                                          l e ,   W r i t a b l e >   o c ,
                        / /   P r e p e n d   a n   i n d e x   t o   t h e   v a l u e   s o   w e   k n o w   R e p o r t e r lr e p o r t e r )
                                                                                                                w h i c h   f i   e                   t h r                                                           o w s   I O E x c e p t i o n   {
                        / /   i t   c a m e   f r o m .                                                 / /   A d d   u p   a l l   t h e   v a l u e s   w                                                           e   s e e           J o b C o n f   g r o u p a= pn e w cJ o b C o n f ( M R
                                                                                                                                                                                                                                                                  x   m   l   .   l a s s ) ;
                        T e x t   o u t V a l v= ln e w ;T e x t ( " 1
                                      "   +     a   u   )                                                                                                                                                                                 g r o u p . s e t J o b N a m e ( " G r o u p   U R L s " )
                        o c . c o l l e c t ( o u t K e y ,   o u t V a l ) ;                           l o n g   s u m   =   0 ;                                                                                                         g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T
                }                                                                             i l e   (w h e r . h a s N e x t ( ) )
                                                                                                        i t                             {                                                                                                 g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c
        }                                                                                                       s u m   + =   i t e r . n e x t ( ) . g e t                                                           ( ) ;               g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g
        p u b l i c   s t a t i c   c l a s s   L o a d A n d F i l t e r U s e r s   e x t e n d s   M a p R er e p o r t e r . s e t S t a t u s ( " O K
                                                                                                                d u c e B a s e                                                                                       " ) ;               g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e
                                                                                                                                                                                                                                                                   l e O u t p u t F o r m a t . c l a
                i m p l e m e n t s   M a p p e r < L o n g W r i t a b l e ,   T e x t ,   T e x t ,   } e x t >
                                                                                                        T           {                                                                                                                     g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e
                                                                                                                                                                                                                                          g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U
                              p u b l i c   v o i d   m a p ( L o n g W r i t                                         a b l e   k     ,   T e x t   v a l ,       o c . c o l l e c t ( k e y ,   n e w   L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r
                                                                                                                                                                                                                                          ) ) ;
                                  O u t p u t C o l l e c t o r < T e x t ,                                           T e x t >       o c ,               }                                                                               F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g
                                              R e p o r t e r   r e p o r t e                                         r )   t h r     o w s   I O} x c e p t i o n
                                                                                                                                                  E                   {                                                   P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ;
                                      / /   P u l l   t h e   k e y   o u t                                                                       p u b l i c   s t a t i c   c l a s s   L o a d C l i c k s   e x t e n d s   M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r
                                                                                                                                                                                                                                     F i l e O u t p u t F
                                      S t r i n g   l i n e   =   v a l . t o                                         S t r i n g     ( ) ;        m p l e m e n t s
                                                                                                                                                          i            M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ;
                                                                                                                                                                                                                          P   t h ( " / u s e r   g a t e s / t m p / g r o
                                      i n t   f i r s t C o m m a   =   l i n                                         e . i n d e     x OT e x t > ){
                                                                                                                                          f ( ' , '   ;                                                                                   g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ;
                                      S t r i n g   v a lf i r s t C o m m a
                                                          u e   =   l i n e .                                         s+ b1 ) ;
                                                                                                                        u   s t r     i n g (                                                                                             J o b   g r o u p J o b   =   n e w   J o b ( g r o u p ) ;
                                      i n t   a g e   =   I n t e g e r . p a                                         r s e I n t     ( v a l u e ) ;     p u b l i c   v o i d   m a p (                                                 g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J
                                      i f   ( a g e   <   1 8   | |   a g e                                           >   2 5 )       r e t u r n ;                       W r i t a b l e C o m p a r a b l e   k e y ,
                                      S t r i n g   k e y   =   l i n e . s u                                         b s t r i n     g ( 0 ,   f i r s t C o m m a ) ;   W r i t a b l e   v a l ,                                       J o b C o n f   t o p 1 0 0   =   n e w   J o b C o n f ( M
                                      T e x t   o u t K e y   =   n e w   T e                                         x t ( k e y     ) ;                                 O u t p u t C o l l e c t o r < L o n g W r i t a b l e ,   Tt o p 1 0 0 . s e t J o b N a m e ( " T o p
                                                                                                                                                                                                                                        e x t >   o c ,                               1 0 0   s i t e
                                      / /   P r e p e n d   a n ei n d e x
                                                                    k n o w                                           t o it h e
                                                                                                                      w h   c h       fv a l u e
                                                                                                                                        i l e       s o   w               R et h r o w s
                                                                                                                                                                              p o   t e r   I O E x c e p t i o n
                                                                                                                                                                                            r e p o r t e r )       {                     t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e
                                      / /   i t   c a m e   f r o m .                                                                                             o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l ,   ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W
                                                                                                                                                                                                                                          k e y ) ;
                                      T e x t   o u t V a l   =   n e w   T e                                         x t ( " 2 "       +   v a l u e ) ;}                                                                                t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x
                                      o c . c o l l e c t ( o u t K e y ,   o                                         u t V a l )     ;           }                                                                                       t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ;
                                                                                                                                                                                                                                                                              r m a t ( S e q u e n c
                              }                                                                                                                   p u b l i c   s t a t i c   c l a s s   L i m i t C l i c k s   e x t e n d s   M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c
                                                                                                                                                                                                                                          e d u c e B a s e
                }                                                                                                                                         i m p l e m e n t s   R e d u c e r < L o n g W r i t a b l e ,   T e x t ,   Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C
                                                                                                                                                                                                                                          o n g W r i t a b l e ,   T e x t >   {
                p u b l i c   s t a t i c   c l a s s   J o i n   e x t e n d s   M                                                   a p R e d u c e B a s e                                                                             t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l
                        i m p l e m e n t s   R e d u c e r < T e x t ,   T e x t ,                                                     T e x t ,   T e xi n t {c o u n t
                                                                                                                                                          t >                 =   0 ;                                                     F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t
                                                                                                                                                          p u b l i c e d u c e (
                                                                                                                                                          v o i d   r                                                     P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ;
                              p u b l i c                 v o     i   d   r e         d   u c e ( T       e   x t   k e y    ,                                    L o n g W r i t a b l e   k e y ,                                F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p
                                                            I     t   e r a t         o   r < T e x       t   >   i t e r    ,                                    I t e r a t o r < T e x t >   i t e r ,                 P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1
                                                            O     u   t p u t         C   o l l e c       t   o r < T e x    t ,      T e x t >   o c ,           O u t p u t C o l l e c t o r < L o n g W r i t a b l e ,   T e x t >   t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ;
                                                                                                                                                                                                                                          o c ,
                                                            R     e   p o r t         e   r   r e p       o   r t e r )      t h r    o w s   I O E x c e p t i oR e p o r t e r
                                                                                                                                                                  n   {             r e p o r t e r )   t h r o w s   I O E x c e p t i oJ o b
                                                                                                                                                                                                                                          n   {   l i m i t   =   n e w   J o b ( t o p 1 0 0 ) ;
                                              / /         F o     r     e a c         h     v a l u       e   ,   f i g u    r e      o u t   w h i c h   f i l e   i t ' s   f r o m   a n d                                             l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b
s t o r e               i t                                                                                                                                       / /   O n l y   o u t p u t   t h e   f i r s t   1 0 0   r e c o r d s
                                              / /   a c c o r d i n g l y .                                                                                       w< i1 0 0 (& & ui t e r . h a s N e x t ( ) )
                                                                                                                                                                    h   l e     c o   n t                             {                   J o b C o n t r o l   j c   =   n e1 0 0 os i t e s rf o r
                                                                                                                                                                                                                                                                              w   J   b C o n t   o l
                                              L i s t < S t r i n g >   f i r s t   =   n e w   A                                     r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y ,
                                                                                                                                                                          ;                               i t e r . n e x1 8 )t o
                                                                                                                                                                                                                          t (   ) ;   2 5 " ) ;
                                              L i s t < S t r i n g >   s e c o n d   =   n e w                                       A r r a y L i s t < S t r i n g > (c o u n t + + ;
                                                                                                                                                                          ) ;                                                             j c . a d d J o b ( l o a d P a g e s ) ;
                                                                                                                                                                  }                                                                       j c . a d d J o b ( l o a d U s e r s ) ;
                        w h i l e                                     ( i t       e   r .   h a   s   N   e x t   (   )   )   {                           }                                                                               j c . a d d J o b ( j o i n J o b ) ;
                                T                                 e   x t         t     =     i   t   e   r . n   e   x   t ( ) ;                 }                                                                                       j c . a d d J o b ( g r o u p J o b ) ;
                                S                                 t   S t r
                                                                      r i n       i
                                                                                  g   n g
                                                                                        v   (
                                                                                            a )
                                                                                              l   ;
                                                                                                  u   e     =     t   .   t o                     p u b l i c   s t a t i c   v o i d   m a i n ( S t r i n g [ ]   a r g s )   t h r o wj c . a d d J o b ( l i m i t ) ;
                                                                                                                                                                                                                                          s   I O E x c e p t i o n   {
                                i                                 f     ( v       a   l u   e .   c   h   a r A   t   (   0 )   = =     ' 1 ' )           J o b C o n f   l p   =   n e w   J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ;
                                                                                                                                                                                                                                          ) ;
f i r s t . a d d ( v a l u e . s                                 u   b s t       r   i n   g (   1   )   ) ;                                          t J o b N a m e ( " L o a d
                                                                                                                                                          l p . s e                  P a g e s " ) ;                              }
                                e                                 l   s e         s   e c   o n   d   .   a d d   ( v a l u e .       s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ;
                                                                                                                                                          1 ) ) ;                                                         t
                                                                                                                                                                          46
Ease of Translation

Load Users                     Load Pages
                                                 Users = LOAD …
Filter by age                                    Filtered = FILTER …
                                                 Pages = LOAD …
                 Join on name                    Joined = JOIN …
                 Group on url                    Grouped = GROUP …
                                                 Summed = … COUNT()…
                 Count clicks                    Sorted = ORDER …
                Order by clicks
                                                 Top5 = LIMIT …

                  Take top 5

                                            47
The Hadoop/Pig/Cassandra stack
 Cassandra has gained some significant integration points
  with Hadoop and its analytics tools
 In order to achieve Hadoop’s data locality, Cassandra nodes
  must be part of the Hadoop cluster by running a tasktracker
  process. So the namenode and jobtracker can reside outside
  of the Cassandra cluster

                                              A three- node
                                              Cassandra/Hadoop
                                              cluster with external
                                              namenode / jobtracker



                              48
Hadoop jobs

 Cassandra has a Java source package for Hadoop integration
  org.apache.cassandra.hadoop

 ColumnFamilyInputFormat extends InputFormat
 ColumnFamilyOutputFormat extends OutputFormat
 ConfigHelper a helper class to configure Cassandra-specific
  information

 Hadoop output streaming was introduced in 0.7 but removed
  from 0.8

                              49
Pig alongside Cassandra

 The Pig integration CassandraStorage() (a LoadFunc
  implementation) allows Pig to Load/Store data from/to
  Cassandra
  grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();

 The pig_cassandra script, shipped with Cassandra source,
  performs the necessary initialisation (Pig environments
  variables still needs to be set)

 Pygmalion is a set of scripts and UDFs to facilitate the use of Pig
  alongside Cassandra

                                 50
Workflow

 A workflow system provides an infrastructure to set up &
  manage a sequence of interdependent jobs / set of jobs

 The hadoop ecosystem includes a set of workflow tools to
  run applications over MapReduce processes or High-level
  languages
   Cascading (http://www.cascading.org/). A java library defining data
    processing workflows and rendering them to MapReduce jobs
   Oozie (http://yahoo.github.com/oozie/)

                                  51
Some links

 http://hadoop.apache.org
 http://pig.apache.org/
 https://cwiki.apache.org/confluence/display/PIG/Index
 PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank
 DataFu: https://github.com/linkedin/datafu
 Pygmalion: https://github.com/jeromatron/pygmalion
 http://code.google.com/edu/parallel/mapreduce-tutorial.html
 Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training
 Interesting papers:
   http://bit.ly/rskJho - Original MapReduce paper
   http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of
    MapReduce: The Pig Experience’

                                          52
A simple data flow

    Load checkins data



   Keep only the two ids
                                    Top 50 users / locations
                                    [same script, different group key]
Group by user/loc id & Order



      Limit to top 50


                               53
Another data flow

Load checkins data



    Split_date

                     All the checkins, over weeks
  Group by date



                          Group by weeks using
 Count the tuples
                                Stream


                     54

More Related Content

What's hot

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoopjoelcrabb
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorialawesomesos
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGAdam Kawa
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questionsKalyan Hadoop
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easyVictor Sanchez Anguix
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsAsad Masood Qazi
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 

What's hot (20)

Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Hadoop Tutorial
Hadoop TutorialHadoop Tutorial
Hadoop Tutorial
 
Introduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUGIntroduction To Apache Pig at WHUG
Introduction To Apache Pig at WHUG
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Apache pig
Apache pigApache pig
Apache pig
 
Hadoop Interview Questions and Answers
Hadoop Interview Questions and AnswersHadoop Interview Questions and Answers
Hadoop Interview Questions and Answers
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Apache Pig: Making data transformation easy
Apache Pig: Making data transformation easyApache Pig: Making data transformation easy
Apache Pig: Making data transformation easy
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 

Viewers also liked

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache PigTapan Avasthi
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsSkillspeed
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pigprash1784
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processorTushar B Kute
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop EasyNick Dimiduk
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopZheng Shao
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBaseHortonworks
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Viewers also liked (18)

Introduction to Apache Pig
Introduction to Apache PigIntroduction to Apache Pig
Introduction to Apache Pig
 
Introduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig FundamentalsIntroduction to Pig | Pig Architecture | Pig Fundamentals
Introduction to Pig | Pig Architecture | Pig Fundamentals
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Introduction to Hadoop and Pig
Introduction to Hadoop and PigIntroduction to Hadoop and Pig
Introduction to Hadoop and Pig
 
Big Data and Analytics
Big Data and AnalyticsBig Data and Analytics
Big Data and Analytics
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Apache Pig: A big data processor
Apache Pig: A big data processorApache Pig: A big data processor
Apache Pig: A big data processor
 
Pig, Making Hadoop Easy
Pig, Making Hadoop EasyPig, Making Hadoop Easy
Pig, Making Hadoop Easy
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
HIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on HadoopHIVE: Data Warehousing & Analytics on Hadoop
HIVE: Data Warehousing & Analytics on Hadoop
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Integration of Hive and HBase
Integration of Hive and HBaseIntegration of Hive and HBase
Integration of Hive and HBase
 
Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)Hadoop, Pig, and Twitter (NoSQL East 2009)
Hadoop, Pig, and Twitter (NoSQL East 2009)
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Similar to Hadoop pig

Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introductionrajsandhu1989
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperDavid Luan
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesKelly Technologies
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 

Similar to Hadoop pig (20)

Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Map reducefunnyslide
Map reducefunnyslideMap reducefunnyslide
Map reducefunnyslide
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaper
 
HADOOP
HADOOPHADOOP
HADOOP
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Unit 1
Unit 1Unit 1
Unit 1
 
Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 

More from Sean Murphy

Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sqlSean Murphy
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overviewSean Murphy
 
No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introductionSean Murphy
 
Rss announcements
Rss announcementsRss announcements
Rss announcementsSean Murphy
 
UCD Android Workshop
UCD Android WorkshopUCD Android Workshop
UCD Android WorkshopSean Murphy
 

More from Sean Murphy (8)

Demonstration
DemonstrationDemonstration
Demonstration
 
Overview of no sql
Overview of no sqlOverview of no sql
Overview of no sql
 
Cassandra overview
Cassandra overviewCassandra overview
Cassandra overview
 
No sql course introduction
No sql course   introductionNo sql course   introduction
No sql course introduction
 
Rss talk
Rss talkRss talk
Rss talk
 
Rss announcements
Rss announcementsRss announcements
Rss announcements
 
Rocco pres-v1
Rocco pres-v1Rocco pres-v1
Rocco pres-v1
 
UCD Android Workshop
UCD Android WorkshopUCD Android Workshop
UCD Android Workshop
 

Recently uploaded

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Hadoop pig

  • 1. The analytics stack Hadoop & Pig
  • 2. Outline of the presentation  Hadoop  Motivations. What is it? And high-level concepts  The Ecosystem. The MapReduce model & framework and HDFS  Programming with Hadoop  Pig  What is it? Motivations  Model & components  Integration with Cassandra 2
  • 3. Please interrupt and ask questions! 3
  • 4. Traditional HPC systems  CPU-intensive computations  Relatively small amount of data  Tightly-coupled applications  Highly concurrent I/O requirements  Complex message passing paradigms such as MPI, PVM…  Developers might need to spend some time designing for failure 4
  • 5. Challenges  Data and storage  Locality, computation close to the data  In large-scale systems, nodes fail  Mean time between failures: 1 node / 3 years, 1000 nodes / 1 day  Built-in fault-tolerance  Distributed programming is complex  Need a simple data-parallel programming model. Users would structure the application in high-level functions, the system distributes the data & jobs and handles communications and faults 5
  • 6. What requirements  A simple data-parallel programming model, designed for high scalability and resiliency  Scalability to large-scale data volumes  Automated fault-tolerance at application level rather than relying on high-availability hardware  Simplified I/O and tasks monitoring  All based on cost-efficient commodity machines (cheap, but unreliable), and commodity network 6
  • 7. Hadoop’s core concepts  Data spread in advance, persistent (in terms of locality), and replicated  No inter-dependencies / shared nothing architecture  Applications written in two pieces of code  And developers do not have to worry about the underlying issues in networking, jobs interdependencies, scheduling, etc… 7
  • 8. Where does it come from?  Hadoop originated from Apache Nutch, an open source web search engine  After the publications of the GFS and MapReduce papers, in 2003 & 2004, the Nutch developers decided to implement open source versions  In February 2006, it became Hadoop, with a dedicated team at Yahoo!  September 2007 - release 0.14.1  Last release 1.0.3 out last week  Used by a large number of companies including Facebook, LinkedIn, Twitter, hulu, among many others.. 8
  • 9. The model  A map function processes a key/value pair to generate a set of intermediate key/value pairs  Divides the problem into smaller ‘intermediate key/value’ pairs  The reduce function merge all intermediate values associated with the same intermediate key  Run-time system takes care of:  Partitioning the input data across nodes (blocks/chunks typically of 64Mb to 128Mb)  Scheduling the data and execution. Maps operate on a single block.  Manages node failures, replication, re-submissions.. 9
  • 10. Simple Word Count ♯key: offset, value: line def mapper(): for line in open(“doc”): for word in line.split(): output(word, 1) ♯key: a word, value: iterator over counts def reducer(): output(key, sum(value)) 10
  • 11. The Combiner  A combiner is a local aggregation function for repeated keys produced by the map  Works for associative functions like sum, count, max  Decreases the size of intermediate data / communications  map-side aggregation for word count: def combiner(): output(key, sum(values)) 11
  • 12. Some other basic examples…  Distributed Grep:  Map function emits a line if it matches a supplied pattern  Reduce function is an identity function that copies the supplied intermediate data to the output  Count of URL accesses:  Map function processes logs of web page requests and outputs <URL, 1>  Reduce function adds together all values for the same URL, emitting <URL, total count> pairs  Reverse Web-Link graph:  Map function outputs <tgt, src> for each link to a tgt in a page named src  Reduce concatenates the list of all src URLS associated with a given tgt URL and emits the pair: <tgt, list(src)>  Inverted Index:  Map function parses each document, emitting a sequence of <word, doc_ID>  Reduce accepts all pairs for a given word and emits a <word, list(doc_ID)> pair 12
  • 13. Hadoop Ecosystem Core 13 components from http://indoos.wordpress.com/2010/08/16/hadoop-ecosystem-world-map/
  • 14. Hadoop components  Hadoop consists of two core components  The MapReduce framework, and  The Hadoop Distributed File System  MapReduce layer  JobTracker  TaskTrackers  HDFS layer  Namenode  Secondary namenode  Datanode Example of a typical physical distribution within a 14 Hadoop cluster
  • 15. HDFS  Scalable and fault-tolerant. Based on Namenode Google’s GFS File1 1  Single namenode stores metadata (file 2 3 names, block locations, etc.). 4  Files split into chunks, replicated across several datanodes (typically 3+). It is rack- aware  Optimised for large files, sequential 1 2 1 3 streaming reads, rather than random 2 1 4 2 4 3 3 4  Files written once, no append Datanodes 15
  • 16. HDFS  HDFS API / HDFS FS Shell for command line* > hadoop fs –copyFromLocal local_dir hdfs_dir > hadoop fs –copToLocal hdfs_dir local_dir  Tools  Flume: collects, aggregates and move log data from application servers to HDFS  Sqoop: HDFS import and export to SQL *http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html 16
  • 17. MapReduce execution  In Hadoop, a Job (full program) is a set of tasks  Each task (mapper or reducer) is attempted at least once, or multiple times if it crashes. Multiple attempts may also occur in parallel  The tasks run inside a separate JVM on the tasktracker  All the class files are assembled into a jar file, which will be uploaded into HDFS, before notifying the tasktracker 17
  • 18. MapReduce execution MapReduce Job Master Split 0 Worker Split 1 Worker read Local write Split 2 Worker Remote read Worker Split 3 Split 4 Worker Output files Intermediate Input files files locally 18
  • 19. Getting Started…  Multiple choices - Vanilla Apache version, or one of the numerous existing distros  hadoop.apache.org  www.cloudera.com [A set of VMs is also provided]  http://www.karmasphere.com/  …  Three ways to write jobs in Hadoop:  Java API  Hadoop Streaming (for Python, Perl, etc.)  Pipes API (C++) 19
  • 20. Word Count in Java public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); FileInputFormat.setInputPaths(conf, args[0]); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); } 20
  • 21. Word Count in Java – mapper public class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { out.collect(new text(itr.nextToken()), ONE); } } } 21
  • 22. Word Count in Java – reducer public class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); } } 22
  • 23. Getting keys and values Input file Reducer Reducer Input Format Input split Input split Output Format RecordWriter RecordWriter RecordReader RecordReader Output file Output file Mapper Mapper 23
  • 24. Hadoop Streaming Mapper.py: #!/usr/bin/env python import sys for line in sys.stdin: for word in line.split(): print "%st%s" % (word, 1) Reducer.py: #!/usr/bin/env python import sys dict={} for line in sys.stdin: word, count = line.split("t", 1) dict[word] = dict.get(word, 0) + int(count) counts = dict.items() for word, count in counts: print "%st%s" % (word.lower(), count) You can locally test your code on the command line: $> cat data | mapper | sort | reducer 24
  • 25. High-level tools  MapReduce is fairly low-level: must think about keys, values, partitioning, etc.  How to express parallel algorithms by a series of MapReduce jobs  Can be hard to capture common job building blocks  Different use cases require different tools! 25
  • 26. Pig  Apache Pig is a platform raising a level of abstraction for processing large datasets. Its language, Pig Latin is a simple query algebra expressing data transformations and applying functions to records Pig MapReduce jobs Hadoop / HDFS job submission  Started at Yahoo! Research, >60% of Hadoop jobs within Yahoo! are Pig jobs 26
  • 27. Motivations  MapReduce requires a Java programmer  Solution was to abstract it and create a system where users are familiar with scripting languages  Other than very trivial applications, MapReduce requires multiple stages, leading to long development cycles  Rapid prototyping. Increased productivity  In MapReduce users have to reinvent common functionality (join, filter, etc.)  Pig provides them 27
  • 28. Used for  Rapid prototyping of algorithms for processing large datasets  Log analysis  Ad hoc queries across various large datasets  Analytics (including through sampling)  Pig Mix provides a set of performance and scalability benchmarks. Currently 1.1 times MapReduce speed. 28
  • 29. Using Pig  Grunt, the Pig shell  Executing scripts directly  Embedding Pig in Java (using PigServer, similar to SQL using JDBC), or Python  A range of tools including Eclipse plug-ins  PigPen, Pig Editor… 29
  • 30. Execution modes  Pig has two execution types or modes: local mode and Hadoop mode  Local  Pig runs in a single JVM and accesses the local filesystem. Starting form v0.7 it uses the Hadoop job runner.  Hadoop mode  Pig runs on a Hadoop cluster (you need to tell Pig about the version and point it to your Namenode and Jobtracker 30
  • 31. Running Pig  Pig resides on the user’s machine and can be independent from the Hadoop cluster  Pig is written in Java and is portable  Compiles into map reduce jobs and submit them to the cluster  No need to install anything extra on the cluster Pig client 31
  • 32. How does it work  Pig defines a DAG. A step-by-step set of operations, each performing a transformation  Pig defines a logical plan for these transformations: A = LOAD ’file' as (line); • Parses, checks, & optimises B = FOREACH A GENERATE • Plan the execution FLATTEN(TOKENIZE(line)) AS word; • Maps & Reduces C = GROUP B BY word; • Passes the jar to Hadoop D = FOREACH C GENERATE group, • Monitor the progress COUNT(words); STORE D INTO ‘output’ 32
  • 33. Data types & expressions  Scalar type:  Int, Long, Float, Double, Chararray, Bytearray  Complex type representing nested structures:  Tuple: sequence of fields of any type  Bag: an unordered collection of tuples  Map: a set of key-value pairs. Keys must be atoms, values may be any type  Expressions:  used in Pig as a part of a statement; field name, position ($), arithmetic, conditional, comparison, Boolean, etc. 33
  • 34. Functions  Load / Store  Data loaders; PigStorage, BinStorage, BinaryStorage, TextLoader, PigDump  Evaluation  Many built-in functions MAX, COUNT, SUM, DIFF, SIZE…  Filter  A special type of eval function used by the FILTER operator. IsEmpty is a built-in function  Comparison  Function used in ORDER statement; ASC | DESC 34
  • 35. Schemas  Schemas enable you to associate names and types of the fields in the relation  Schemas are optional but recommended whenever possible; type declarations result in better parse-time error checking and more efficient code execution  They are defined using the AS keyword with operators  Schema definition for simple data types: > records = LOAD 'input/data' AS (id:int, date:chararray); 35
  • 36. Statements and aliases  Each statement, defining a data processing operator / relation, produces a dataset with an alias grunt> records = LOAD 'input/data' AS (id:int, date:chararray);  LOAD returns a tuple, which elements can be referenced by position or by name  Very useful operators are DUMP, ILLUSTRATE, and DESCRIBE 36
  • 37. Filtering data  Filter is user to work with tuples and rows of data  Select data you want, or remove the data you are not interested in  Filtering early in the processing pipeline minimises the amount of data flowing through the system, which can improve efficiency grunt> filtered_records = FILTER records BY id == 234; 37
  • 38. Foreach .. Generate  Foreach .. Generate acts on columns on every row in a relation grunt> ids = FOREACH records GENERATE id;  Positional reference. This statement has the same output grunt> ids = FOREACH records GENERATE $0;  The elements of ‘ids’ however are not named ‘id’ unless you add ‘AS id’ at the end of your statement grunt> ids = FOREACH records GENERATE $0 AS id; 38
  • 39. Grouping and joining  Group .. by makes an output bag containing grouped fields with the same schema using a grouping key  Join performs inner, equijoin of two or more relations based on common field values.  You can also perform outer joins using keywords left, right and full  Cogroup is similar to Group, using multiple relations, and creates a nested set of output tuples 39
  • 40. Ordering, combining, splitting…  Order imposes an order on the output to sort a relation by one or more fields  The Limit statement limits the number of results  Split partitions a relation into two or more relations  the Sample operator selects a random data sample with the stated sample size  the Union operator to merge the contents of two or more relations 40
  • 41. Stream  The Stream operator allows to transform data in a relation using an external program or script grunt> C = STREAM A THROUGH `cut -f 2`;  Extract the second field of A using cut  The scripts are shipped to the cluster using grunt> DEFINE script `script.py` SHIP (‘script.py’); grunt> D = STREAM C THROUGH script AS (…); 41
  • 42. User defined functions  Support and a community of user-defined functions (UDFs)  UDFs can encapsulate users processing logic in filtering, comparison, evaluation, grouping, or storage  filter functions for instance are all subclasses of FilterFunc, which itself is a subclass of EvalFunc  PiggyBank: the Pig community sharing their UDFs  DataFu: Linkedin's collection of Pig UDFs 42
  • 43. A simple eval UDF example package myudfs; import … public class UPPER extends EvalFunc<String> { public String exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return null; try{ String str = (String) input.get(0); return str.toUpperCase(); }catch(Exception e){ throw WrappedIOException.wrap("Caught exception processing input row ", e); } } } 43
  • 44. An Example Load Users Load Pages Let’s find the top 5 most visited Filter by age pages by users aged 18 – 25. Input: user data file, and page Join on name view data file. Group on url Count clicks Order by clicks Take top 5 44
  • 45. A simple script Users = LOAD ‘users’ AS (name, age); Filtered = FILTER Users BY age >= 18 and age <= 25; Pages = LOAD ‘pages’ AS (user, url); Joined = JOIN Filtered BY name, Pages by user; Grouped = GROUP Joined BY url; Summed = FOREACH Grouped GENERATE group, count(Joined) AS clicks; Sorted = ORDER Summed BY clicks desc; Top5 = LIMIT Sorted 5; STORE Top5 INTO ‘top5sites’; 45
  • 46. i i i i m m m m p p p p i m p o r t o o o o r r r r t t t t j j j j a a a a v v v v a a a a . . . . i u u u o t t t . i i i I l l l O . . . o r g . a p a c h e . h a d o o p . f s . P a t h ; E A I L x r t i c r e s e a r t p t i o n ; y L i s t ; a t o r ; ; / / f o r D o t h e ( S t r i n g f o r In MapReduce! c r o s s s 1 : ( S t r i n g p r o d u c t f i r s t ) s 2 : { a n s e c o n } r e p o r t e r . s e t S t a t u s ( " O K " ) ; d d ) c o l l e c t l p . s e t O u t p u t K e y C l p . s e t O u t p u t V a l u l p . s e t M a p p e r C l a s F i l e I n p u t F o r m a t . t h e v a l u e s P a t hu s e r / g a t e s / p a g e s " ) ) ; { ( " / F i l e O u t p u t F o r m a t l e s a . a C ( d s s l L d s a o I ( s a n e t O u t T s d p e ( P u x T a t t e g P . x e a c t s t p u t P a t h l . . h a c c ( s l l l i m p o r t o r g . a p a c h e . h a d o o p . i o . L o n g W r i t a b l e ; S t r i n g o u t v a l = k e y + " , " + s 1 +n e w "P a t h ( " / u " , + s 2 ; s e r / g a t e s / t m p / i m p o r t o r g . a p a c h e . h a d o o p . i o . T e x t ; o c . c o l l e c t ( n u l l , n e w T e x t ( o ul p . s e t N u m R e d u c e T t v a l ) ) ; a s k s ( 0 ) ; i m p o r t o r g . a p a c h e . h a d o o p . i o . W r i t a b l e ; r e p o r t e r . s e t S t a t u s ( " O K " ) ; J o b l o a d P a g e s = n e w J o b ( l p ) ; i p o r t m o r g . a p a c h e . h a d o o p . i o . W r i t a b l e C o m p a r a b l e ; } i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e I n p u t F o r m a t ; } J o b C o n f l f u = n e w J o b C o n f ( M R E x i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . F i l e O u t p u t F o r m a t}; e t J o b N a m e ( " L o a d l f u . s a n d F i l t e r U s e r s i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . J o b C o n f ; } l f u . s e t I n p u t F o r m a t ( T e x t I n p u t F o i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . K e y V a l u e T e x tp u b l i c os t a t i c I n p u t F r m a t ; c l a s s L o a d J o i n e d e x t e n d s M a p R el f u . s e t O u t p u t K e y C l a s s ( T e x t . c l a d u c e B a s e i m p o r t po r g . a h a d o o p . m a p r e d . M a p p e r ; a c h e . i m p l e m e n t s M a p p e r < T e x t , T e x t , T e x t , L o n gl f u . s e t O u t p u t V a l u e C l a s s ( T e x t . c W r i t a b l e > { i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . M a p R e d u c e B a s e ; l f u . s e t M a p p e r C l a s s ( L o a d A n d F i l t i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . O u t p u t C o l l e c t o r ;p u b l i c v o i d m a p ( F i l e I nI n p u t P a t h ( l f u , p u t F o r m a t . a d d n e w i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e c o r d R e a d e r ; T e x t k , P a t h ( " / u s e r / g a t e s / u s e r s " ) ) ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e d u c e r ; T e x t v a l , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . R e p o r t e r ; c t o r < T e x t , lL o n g W r i t a b l e > O u t p u t C o l e o c , n e w P a t h ( " / u s e r / g a t e s / t m p / i m p t o r o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e I n p u t F o r m a t ; R e p o r t e r r e p o r t e r ) t h r o w s I O E x c el f u . s e t N u m R e d u c e T a s k s ( 0 ) ; p t i o n { i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . S e q u e n c e F i l e O u t p u t F o/ / aF i n d r m t ; t h e u r l J o b l o a d U s e r s = n e w J o b ( l f u ) ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . T e x t I n p u t F o r m a t ; S t r i n g l i n e = v a l . t o S t r i n g ( ) ; i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; J o b C o n f j oM R E x a m p l e . c l a s s ) ; i n = n e w J o b C o n f ( i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . j o b c o n t r o l . J o b C o n t r o l ; i n t s e c o n d C o m m a C= ml i n e . i n d o m a ) ; e x O f ( ' , ' , j o i n . s e t J o b N a m e ( " J o i n f i r s t U s e r s a n i m p o r t o r g . a p a c h e . h a d o o p . m a p r e d . l i b . I d e n t i t y M a p p e r ; S t r i n g k e y = l i n e . s u b s t r i n g ( f i r s t C o mj o i n . s e t I n p u t F o r m a t ( K e y V a l u e T e m a , s e c o n d C o m m a ) ; / / d r o p t h e r e s t o f t h e r e c o r d , I d oj o i n . s e t O u t p u t K e y C l a s s ( T e x t . c l n ' t n e e d i t a n y m o r e , p u b l i c c l a s s M R E x a m p l e { / / j u s t p a s s a 1 f o r t h e c o m b i n e r / r ej o i n . s e t O u t p u t V a l u e C l a s s ( T e x t . d u c e r t o s u m i n s t e a d . p u b l i c s t a t i c c l a s s L o a d P a g e s e x t e n d s M a p R e d u c e BT e x t a s e o u t K e y = n e w T e x t ( k e y ) ; j o i n . s e t M a p p e r C l a s s ( I d e n t i t y M a p e r . c l a s s ) ; i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t ,o c . c o l l e c t ( o u t K e y , T e x t > { n e w L o n g W r i t a b l e ( 1j o i n . s e t R e d u c e r C l a s s ( J o i n . c l a s L ) ) ; } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t} v a l , P a t h ( " / u s e r / g a t e s / t m p / i n d e x e d _ p a g e s " ) ) O u t p u t C o l l e c t o r < T e x t , T e x t > o c , b l i c p u s t a t i c c l a s s R e d u c e U r l s e x t e n d s M a p R eF i l e I n p u t F o r m a t . a d d I n p u t P a t h ( j d u c e B a s e R e p o r t e r r e p o r t e r ) t h r o w s I O E x c ei m p l e m e n t s p t i o n { R e d u c e r < T e x t , L o n g W r iP a t h ( " / u s e r / g a t e s / t m p / f i l t e r e d _ u s e r s " ) t a b l e , W r i t a b l e C o m p a r a b l e , / / P u l l t h e k e y o u t W r i t a b l e > { F i l e O ut O u t p u t P a t h ( j o i n , t p u t F o r m a t . s e n e w S t r i n g l i n e = v a l . t o S t r i n g ( ) ; P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; i n t f i r s t C o m m a = l i n e . i n d e x O f ( ' , ' ) ; p u b l i c v o i d r e d u c e ( j o i n . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i ns t r i n g ( 0 , if i r s t C o m m a ) ; g k e y = l n e . u b y , T e x t k e J o b j o i n J o b = n e w J o b ( j o i n ) ; S t r i n g v a l u e = l i n e . s u b s t r i n g ( f i r s t C o m m a + 1I t e r a t o r < L o n g W r i t a b l e > ) ; i t e r , j o i n J o b . a d d D e p e n d i n g J o b ( l o a d P a T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < W r i t a b l e C o m p a r a bj o i n J o b . a d d D e p e n d i n g J o b ( l o a d U s l e , W r i t a b l e > o c , / / P r e p e n d a n i n d e x t o t h e v a l u e s o w e k n o w R e p o r t e r lr e p o r t e r ) w h i c h f i e t h r o w s I O E x c e p t i o n { / / i t c a m e f r o m . / / A d d u p a l l t h e v a l u e s w e s e e J o b C o n f g r o u p a= pn e w cJ o b C o n f ( M R x m l . l a s s ) ; T e x t o u t V a l v= ln e w ;T e x t ( " 1 " + a u ) g r o u p . s e t J o b N a m e ( " G r o u p U R L s " ) o c . c o l l e c t ( o u t K e y , o u t V a l ) ; l o n g s u m = 0 ; g r o u p . s e t I n p u t F o r m a t ( K e y V a l u e T } i l e (w h e r . h a s N e x t ( ) ) i t { g r o u p . s e t O u t p u t K e y C l a s s ( T e x t . c } s u m + = i t e r . n e x t ( ) . g e t ( ) ; g r o u p . s e t O u t p u t V a l u e C l a s s ( L o n g p u b l i c s t a t i c c l a s s L o a d A n d F i l t e r U s e r s e x t e n d s M a p R er e p o r t e r . s e t S t a t u s ( " O K d u c e B a s e " ) ; g r o u p . s e t O u t p u t F o r m a t ( S e q u e n c e l e O u t p u t F o r m a t . c l a i m p l e m e n t s M a p p e r < L o n g W r i t a b l e , T e x t , T e x t , } e x t > T { g r o u p . s e t M a p p e r C l a s s ( L o a d J o i n e g r o u p . s e t C o m b i n e r C l a s s ( R e d u c e U p u b l i c v o i d m a p ( L o n g W r i t a b l e k , T e x t v a l , o c . c o l l e c t ( k e y , n e w L o n g W r i t a b l e ( s u mg r o u p . s e t R e d u c e r C l a s s ( R e d u c e U r ) ) ; O u t p u t C o l l e c t o r < T e x t , T e x t > o c , } F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( g R e p o r t e r r e p o r t e r ) t h r o w s I O} x c e p t i o n E { P a t h ( " / u s e r / g a t e s / t m p / j o i n e d " ) ) ; / / P u l l t h e k e y o u t p u b l i c s t a t i c c l a s s L o a d C l i c k s e x t e n d s M a p R e d u c e B a s e o r m a t . s e t O u t p u t P a t h ( g r F i l e O u t p u t F S t r i n g l i n e = v a l . t o S t r i n g ( ) ; m p l e m e n t s i M a p p e r < W r i t a b l e C o m p a r a b l e , aW r i t a b l e , /L o n g W r i t a b l e , u p e d " ) ) ; P t h ( " / u s e r g a t e s / t m p / g r o i n t f i r s t C o m m a = l i n e . i n d e x OT e x t > ){ f ( ' , ' ; g r o u p . s e t N u m R e d u c e T a s k s ( 5 0 ) ; S t r i n g v a lf i r s t C o m m a u e = l i n e . s+ b1 ) ; u s t r i n g ( J o b g r o u p J o b = n e w J o b ( g r o u p ) ; i n t a g e = I n t e g e r . p a r s e I n t ( v a l u e ) ; p u b l i c v o i d m a p ( g r o u p J o b . a d d D e p e n d i n g J o b ( j o i n J i f ( a g e < 1 8 | | a g e > 2 5 ) r e t u r n ; W r i t a b l e C o m p a r a b l e k e y , S t r i n g k e y = l i n e . s u b s t r i n g ( 0 , f i r s t C o m m a ) ; W r i t a b l e v a l , J o b C o n f t o p 1 0 0 = n e w J o b C o n f ( M T e x t o u t K e y = n e w T e x t ( k e y ) ; O u t p u t C o l l e c t o r < L o n g W r i t a b l e , Tt o p 1 0 0 . s e t J o b N a m e ( " T o p e x t > o c , 1 0 0 s i t e / / P r e p e n d a n ei n d e x k n o w t o it h e w h c h fv a l u e i l e s o w R et h r o w s p o t e r I O E x c e p t i o n r e p o r t e r ) { t o p 1 0 0 . s e t I n p u t F o r m a t ( S e q u e n c e / / i t c a m e f r o m . o c . c o l l e c t ( ( L o n g W r i t a b l e ) v a l , ( T e x t )t o p 1 0 0 . s e t O u t p u t K e y C l a s s ( L o n g W k e y ) ; T e x t o u t V a l = n e w T e x t ( " 2 " + v a l u e ) ;} t o p 1 0 0 . s e t O u t p u t V a l u e C l a s s ( T e x o c . c o l l e c t ( o u t K e y , o u t V a l ) ; } t o p 1 0 0 . s e t O u t p u t F oo r m a t . c l a s s ) ; r m a t ( S e q u e n c } p u b l i c s t a t i c c l a s s L i m i t C l i c k s e x t e n d s M a p Rt o p 1 0 0 . s e t M a p p e r C l a s s ( L o a d C l i c e d u c e B a s e } i m p l e m e n t s R e d u c e r < L o n g W r i t a b l e , T e x t , Lt o p 1 0 0 . s e t C o m b i n e r C l a s s ( L i m i t C o n g W r i t a b l e , T e x t > { p u b l i c s t a t i c c l a s s J o i n e x t e n d s M a p R e d u c e B a s e t o p 1 0 0 . s e t R e d u c e r C l a s s ( L i m i t C l i m p l e m e n t s R e d u c e r < T e x t , T e x t , T e x t , T e xi n t {c o u n t t > = 0 ; F i l e I n p u t F o r m a t . a d d I n p u t P a t h ( t p u b l i c e d u c e ( v o i d r P a t h ( " / u s e r / g a t e s / t m p / g r o u p e d " ) ) ; p u b l i c v o i d r e d u c e ( T e x t k e y , L o n g W r i t a b l e k e y , F i l e O u t p u t F o r m a t . s e t O u t p u t P a t h ( t o p I t e r a t o r < T e x t > i t e r , I t e r a t o r < T e x t > i t e r , P a t h ( " / u s e r / g a t e s / t o p 1 0 0 s i t e s f o r u s e r s 1 O u t p u t C o l l e c t o r < T e x t , T e x t > o c , O u t p u t C o l l e c t o r < L o n g W r i t a b l e , T e x t > t o p 1 0 0 . s e t N u m R e d u c e T a s k s ( 1 ) ; o c , R e p o r t e r r e p o r t e r ) t h r o w s I O E x c e p t i oR e p o r t e r n { r e p o r t e r ) t h r o w s I O E x c e p t i oJ o b n { l i m i t = n e w J o b ( t o p 1 0 0 ) ; / / F o r e a c h v a l u e , f i g u r e o u t w h i c h f i l e i t ' s f r o m a n d l i m i t . a d d D e p e n d i n g J o b ( g r o u p J o b s t o r e i t / / O n l y o u t p u t t h e f i r s t 1 0 0 r e c o r d s / / a c c o r d i n g l y . w< i1 0 0 (& & ui t e r . h a s N e x t ( ) ) h l e c o n t { J o b C o n t r o l j c = n e1 0 0 os i t e s rf o r w J b C o n t o l L i s t < S t r i n g > f i r s t = n e w A r r a y L i s t < S t r i n g > ( )o c . c o l l e c t ( k e y , ; i t e r . n e x1 8 )t o t ( ) ; 2 5 " ) ; L i s t < S t r i n g > s e c o n d = n e w A r r a y L i s t < S t r i n g > (c o u n t + + ; ) ; j c . a d d J o b ( l o a d P a g e s ) ; } j c . a d d J o b ( l o a d U s e r s ) ; w h i l e ( i t e r . h a s N e x t ( ) ) { } j c . a d d J o b ( j o i n J o b ) ; T e x t t = i t e r . n e x t ( ) ; } j c . a d d J o b ( g r o u p J o b ) ; S t S t r r i n i g n g v ( a ) l ; u e = t . t o p u b l i c s t a t i c v o i d m a i n ( S t r i n g [ ] a r g s ) t h r o wj c . a d d J o b ( l i m i t ) ; s I O E x c e p t i o n { i f ( v a l u e . c h a r A t ( 0 ) = = ' 1 ' ) J o b C o n f l p = n e w J o b C o n f ( M R E x a m p l e . c l a s sj c . r u n ( ) ; ) ; f i r s t . a d d ( v a l u e . s u b s t r i n g ( 1 ) ) ; t J o b N a m e ( " L o a d l p . s e P a g e s " ) ; } e l s e s e c o n d . a d d ( v a l u e . s u b s t r i n g (l p . s e t I n p u t F o r m a t ( T e x t I n p u t F o r m a} . c l a s s ) ; 1 ) ) ; t 46
  • 47. Ease of Translation Load Users Load Pages Users = LOAD … Filter by age Filtered = FILTER … Pages = LOAD … Join on name Joined = JOIN … Group on url Grouped = GROUP … Summed = … COUNT()… Count clicks Sorted = ORDER … Order by clicks Top5 = LIMIT … Take top 5 47
  • 48. The Hadoop/Pig/Cassandra stack  Cassandra has gained some significant integration points with Hadoop and its analytics tools  In order to achieve Hadoop’s data locality, Cassandra nodes must be part of the Hadoop cluster by running a tasktracker process. So the namenode and jobtracker can reside outside of the Cassandra cluster A three- node Cassandra/Hadoop cluster with external namenode / jobtracker 48
  • 49. Hadoop jobs  Cassandra has a Java source package for Hadoop integration org.apache.cassandra.hadoop  ColumnFamilyInputFormat extends InputFormat  ColumnFamilyOutputFormat extends OutputFormat  ConfigHelper a helper class to configure Cassandra-specific information  Hadoop output streaming was introduced in 0.7 but removed from 0.8 49
  • 50. Pig alongside Cassandra  The Pig integration CassandraStorage() (a LoadFunc implementation) allows Pig to Load/Store data from/to Cassandra grunt> LOAD 'cassandra://Keyspace/cf' USING CassandraStorage();  The pig_cassandra script, shipped with Cassandra source, performs the necessary initialisation (Pig environments variables still needs to be set)  Pygmalion is a set of scripts and UDFs to facilitate the use of Pig alongside Cassandra 50
  • 51. Workflow  A workflow system provides an infrastructure to set up & manage a sequence of interdependent jobs / set of jobs  The hadoop ecosystem includes a set of workflow tools to run applications over MapReduce processes or High-level languages  Cascading (http://www.cascading.org/). A java library defining data processing workflows and rendering them to MapReduce jobs  Oozie (http://yahoo.github.com/oozie/) 51
  • 52. Some links  http://hadoop.apache.org  http://pig.apache.org/  https://cwiki.apache.org/confluence/display/PIG/Index  PiggyBank: https://cwiki.apache.org/confluence/display/PIG/PiggyBank  DataFu: https://github.com/linkedin/datafu  Pygmalion: https://github.com/jeromatron/pygmalion  http://code.google.com/edu/parallel/mapreduce-tutorial.html  Video tutorials from Cloudera: http://www.cloudera.com/hadoop-training  Interesting papers:  http://bit.ly/rskJho - Original MapReduce paper  http://bit.ly/KvFXxT - Pig paper: ‘Building a High-Level Dataflow System on top of MapReduce: The Pig Experience’ 52
  • 53. A simple data flow Load checkins data Keep only the two ids Top 50 users / locations [same script, different group key] Group by user/loc id & Order Limit to top 50 53
  • 54. Another data flow Load checkins data Split_date All the checkins, over weeks Group by date Group by weeks using Count the tuples Stream 54