BigData using Hadoop and
           Pig

                Sudar Muthu
             Research Engineer
                Yahoo Labs
          http://sudarmuthu.com
      http://twitter.com/sudarmuthu
Who am I?
   Research Engineer at Yahoo Labs
   Mines useful information from huge datasets
   Worked on both structured and unstructured
    data.
   Builds robots as hobby ;)
What we will see today?
   What is BigData?
   Get our hands dirty with Hadoop
   See some code
   Try out Pig
   Glimpse of Hbase and Hive
What is BigData?
“   Big data is a collection of data sets so large   ”
    and complex that it becomes difficult to
    process using on-hand database
    management tools



        http://en.wikipedia.org/wiki/Big_data
How big is BigData?
1GB today is not the same
as 1GB just 10 years before
Anything that doesn’t fit
into the RAM of a single
         machine
Types of Big Data
Data in Movement (streams)
   Twitter/Facebook comments
   Stock market data
   Access logs of a busy web server
   Sensors: Vital signs of a newly born
Data at rest (Oceans)
   Collection of what has streamed
   Emails or IM messages
   Social Media
   Unstructured documents: forms, claims
We have all this data and
 need to find a way to
    process them
Traditional way of scaling
               (Scaling up)
   Make the machine more powerful
     Add more RAM
     Add more cores to CPU

   It is going to be very expensive
   Will be limited by disk seek and read time
   Single point of failure
New way to scale up (Scale out)
   Add more instances of the same machine
   Cost is less compared to scaling up
   Immune to failure of a single or a set of nodes
   Disk seek and write time is not going to be
    bottleneck
   Future safe (to some extend)
Is it fit for ALL types of
         problems?
Divide and conquer
Hadoop
A scalable, fault-tolerant
 grid operating system for
data storage and processing
What is Hadoop?
   Runs on Commodity hardware
   HDFS: Fault-tolerant high-bandwidth clustered
    storage
   MapReduce: Distributed data processing
   Works with structured and unstructured data
   Open source, Apache license
   Master (named-node) – Slave architecture
Design Principles
   System shall manage and heal itself
   Performance shall scale linearly
   Algorithm should move to data
       Lower latency, lower bandwidth
   Simple core, modular and extensible
Components of Hadoop
   HDFS
   Map Reduce
   PIG
   HBase
   Hive
Getting started with
      Hadoop
What I am not going to cover?
   Installation or setting up Hadoop
       Will be running all the code in a single node instance
   Monitoring of the clusters
   Performance tuning
   User authentication or quota
Before we get into code,
 let’s understand some
        concepts
Map Reduce
Framework for distributed
processing of large datasets
MapReduce
Consists of two functions
 Map
       Filter and transform the input, which the reducer
        can understand
   Reduce
       Aggregate over the input provided by the Map
        function
Formal definition
Map
<k1, v1> -> list(<k2,v2>)



Reduce
<k2, list(v2)>   -> list <k3, v3>
Let’s see some examples
Count number of words in files
Map
<file_name, file_contents> => list<word, count>

Reduce
<word, list(count)> => <word, sum_of_counts>
Count number of words in files
Map
<“file1”, “to be or not to be”> =>
{<“to”,1>,
<“be”,1>,
<“or”,1>,
<“not”,1>,
<“to,1>,
<“be”,1>}
Count number of words in files
Reduce
{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>,
<“not”,<1>>}

=>

{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
Max temperature in a year
Map
<file_name, file_contents> => <year, temp>

Reduce
<year, list(temp)> => <year, max_temp>
HDFS
HDFS
   Distributed file system
   Data is distributed over different nodes
   Will be replicated for fail over
   Is abstracted out for the algorithms
HDFS Commands
HDFS Commands
   hadoop fs –mkdir <dir_name>
   hadoop fs –ls <dir_name>
   hadoop fs –rmr <dir_name>
   hadoop fs –put <local_file> <remote_dir>
   hadoop fs –get <remote_file> <local_dir>
   hadoop fs –cat <remote_file>
   hadoop fs –help
Let’s write some code
Count Words Demo
   Create a mapper class
       Override map() method
   Create a reducer class
       Override reduce() method
   Create a main method
   Create JAR
   Run it on Hadoop
Map Method
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {

  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);

   while (itr.hasMoreTokens()) {
     context.write(new Text(itr.nextToken()), new
IntWritable(1));
   }
}
Reduce Method
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {

    int sum = 0;
    for (IntWritable value : values) {
       sum += value.get();
    }
    context.write(key, new IntWritable(sum));
}
Main Method
Job job = new Job();
job.setJarByClass(CountWords.class);
job.setJobName("Count Words");

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(CountWordsMapper.class);

job.setReducerClass(CountWordsReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
Run it on Hadoop


hadoop jar dist/countwords.jar
com.sudarmuthu.hadoop.countwords.CountWord
s input/ output/
Output
at          1
be          3
can         7
can't       1
code        2
command     1
connect     1
consider    1
continued   1
control     4
could       1
couple      1
courtesy    1
desktop,    1
detailed    1
details     1
…..
…..
Pig
What is Pig?
Pig provides an abstraction for processing large
datasets

Consists of
 Pig Latin – Language to express data flows

 Execution environment
Why we need Pig?
   MapReduce can get complex if your data needs
    lot of processing/transformations
   MapReduce provides primitive data structures
   Pig provides rich data structures
   Supports complex operations like joins
Running Pig programs
   In an interactive shell called Grunt
   As a Pig Script
   Embedded into Java programs (like JDBC)
Grunt – Interactive Shell
Grunt shell
   fs commands – like hadoop fs
     fs –ls
     Fs –mkdir

   fs copyToLocal <file>
   fs copyFromLocal <local_file> <dest>
   exec – execute Pig scripts
   sh – execute shell scripts
Let’s see them in action
Pig Latin
   LOAD – Read files
   DUMP – Dump data in the console
   JOIN – Do a join on data sets
   FILTER – Filter data sets
   SORT – Sort data
   STORE – Store data back in files
Let’s see some code
Sort words based on count
Filter words present in a list
HBase
What is Hbase?
   Distributed, column-oriented database built on
    top of HDFS
   Useful when real-time read/write random-access
    to very large datasets is needed.
   Can handle billions of rows with millions of
    columns
Hive
What is Hive?
   Useful for managing and querying structured
    data
   Provides SQL like syntax
   Meta data is stored in a RDBMS
   Extensible with types, functions , scripts etc
Hadoop                           Relational Databases
   Affordable                      Interactive response times
    Storage/Compute                 ACID
   Structured or Unstructured      Structured data
   Resilient Auto Scalability      Cost/Scale prohibitive
Thank You

Hands on Hadoop and pig

  • 1.
    BigData using Hadoopand Pig Sudar Muthu Research Engineer Yahoo Labs http://sudarmuthu.com http://twitter.com/sudarmuthu
  • 2.
    Who am I?  Research Engineer at Yahoo Labs  Mines useful information from huge datasets  Worked on both structured and unstructured data.  Builds robots as hobby ;)
  • 3.
    What we willsee today?  What is BigData?  Get our hands dirty with Hadoop  See some code  Try out Pig  Glimpse of Hbase and Hive
  • 4.
  • 5.
    Big data is a collection of data sets so large ” and complex that it becomes difficult to process using on-hand database management tools http://en.wikipedia.org/wiki/Big_data
  • 6.
    How big isBigData?
  • 7.
    1GB today isnot the same as 1GB just 10 years before
  • 8.
    Anything that doesn’tfit into the RAM of a single machine
  • 9.
  • 10.
    Data in Movement(streams)  Twitter/Facebook comments  Stock market data  Access logs of a busy web server  Sensors: Vital signs of a newly born
  • 11.
    Data at rest(Oceans)  Collection of what has streamed  Emails or IM messages  Social Media  Unstructured documents: forms, claims
  • 12.
    We have allthis data and need to find a way to process them
  • 13.
    Traditional way ofscaling (Scaling up)  Make the machine more powerful  Add more RAM  Add more cores to CPU  It is going to be very expensive  Will be limited by disk seek and read time  Single point of failure
  • 14.
    New way toscale up (Scale out)  Add more instances of the same machine  Cost is less compared to scaling up  Immune to failure of a single or a set of nodes  Disk seek and write time is not going to be bottleneck  Future safe (to some extend)
  • 15.
    Is it fitfor ALL types of problems?
  • 16.
  • 17.
  • 18.
    A scalable, fault-tolerant grid operating system for data storage and processing
  • 19.
    What is Hadoop?  Runs on Commodity hardware  HDFS: Fault-tolerant high-bandwidth clustered storage  MapReduce: Distributed data processing  Works with structured and unstructured data  Open source, Apache license  Master (named-node) – Slave architecture
  • 20.
    Design Principles  System shall manage and heal itself  Performance shall scale linearly  Algorithm should move to data  Lower latency, lower bandwidth  Simple core, modular and extensible
  • 21.
    Components of Hadoop  HDFS  Map Reduce  PIG  HBase  Hive
  • 22.
  • 23.
    What I amnot going to cover?  Installation or setting up Hadoop  Will be running all the code in a single node instance  Monitoring of the clusters  Performance tuning  User authentication or quota
  • 24.
    Before we getinto code, let’s understand some concepts
  • 25.
  • 26.
  • 27.
    MapReduce Consists of twofunctions  Map  Filter and transform the input, which the reducer can understand  Reduce  Aggregate over the input provided by the Map function
  • 28.
    Formal definition Map <k1, v1>-> list(<k2,v2>) Reduce <k2, list(v2)> -> list <k3, v3>
  • 29.
  • 30.
    Count number ofwords in files Map <file_name, file_contents> => list<word, count> Reduce <word, list(count)> => <word, sum_of_counts>
  • 31.
    Count number ofwords in files Map <“file1”, “to be or not to be”> => {<“to”,1>, <“be”,1>, <“or”,1>, <“not”,1>, <“to,1>, <“be”,1>}
  • 32.
    Count number ofwords in files Reduce {<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>, <“not”,<1>>} => {<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}
  • 33.
    Max temperature ina year Map <file_name, file_contents> => <year, temp> Reduce <year, list(temp)> => <year, max_temp>
  • 34.
  • 35.
    HDFS  Distributed file system  Data is distributed over different nodes  Will be replicated for fail over  Is abstracted out for the algorithms
  • 38.
  • 39.
    HDFS Commands  hadoop fs –mkdir <dir_name>  hadoop fs –ls <dir_name>  hadoop fs –rmr <dir_name>  hadoop fs –put <local_file> <remote_dir>  hadoop fs –get <remote_file> <local_dir>  hadoop fs –cat <remote_file>  hadoop fs –help
  • 40.
  • 41.
    Count Words Demo  Create a mapper class  Override map() method  Create a reducer class  Override reduce() method  Create a main method  Create JAR  Run it on Hadoop
  • 42.
    Map Method public voidmap(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { context.write(new Text(itr.nextToken()), new IntWritable(1)); } }
  • 43.
    Reduce Method public voidreduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } context.write(key, new IntWritable(sum)); }
  • 44.
    Main Method Job job= new Job(); job.setJarByClass(CountWords.class); job.setJobName("Count Words"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(CountWordsMapper.class); job.setReducerClass(CountWordsReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class);
  • 45.
    Run it onHadoop hadoop jar dist/countwords.jar com.sudarmuthu.hadoop.countwords.CountWord s input/ output/
  • 46.
    Output at 1 be 3 can 7 can't 1 code 2 command 1 connect 1 consider 1 continued 1 control 4 could 1 couple 1 courtesy 1 desktop, 1 detailed 1 details 1 ….. …..
  • 47.
  • 48.
    What is Pig? Pigprovides an abstraction for processing large datasets Consists of  Pig Latin – Language to express data flows  Execution environment
  • 49.
    Why we needPig?  MapReduce can get complex if your data needs lot of processing/transformations  MapReduce provides primitive data structures  Pig provides rich data structures  Supports complex operations like joins
  • 50.
    Running Pig programs  In an interactive shell called Grunt  As a Pig Script  Embedded into Java programs (like JDBC)
  • 51.
  • 52.
    Grunt shell  fs commands – like hadoop fs  fs –ls  Fs –mkdir  fs copyToLocal <file>  fs copyFromLocal <local_file> <dest>  exec – execute Pig scripts  sh – execute shell scripts
  • 53.
  • 54.
    Pig Latin  LOAD – Read files  DUMP – Dump data in the console  JOIN – Do a join on data sets  FILTER – Filter data sets  SORT – Sort data  STORE – Store data back in files
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
    What is Hbase?  Distributed, column-oriented database built on top of HDFS  Useful when real-time read/write random-access to very large datasets is needed.  Can handle billions of rows with millions of columns
  • 60.
  • 61.
    What is Hive?  Useful for managing and querying structured data  Provides SQL like syntax  Meta data is stored in a RDBMS  Extensible with types, functions , scripts etc
  • 62.
    Hadoop Relational Databases  Affordable  Interactive response times Storage/Compute  ACID  Structured or Unstructured  Structured data  Resilient Auto Scalability  Cost/Scale prohibitive
  • 63.