Hands on Hadoop and pig

BigData using Hadoop and
Pig

Sudar Muthu
Research Engineer
Yahoo Labs
http://sudarmuthu.com
http://twitter.com/sudarmuthu

Who am I?
 Research Engineer at Yahoo Labs
 Mines useful information from huge datasets
 Worked on both structured and unstructured
data.
 Builds robots as hobby ;)

What we will see today?
 What is BigData?
 Get our hands dirty with Hadoop
 See some code
 Try out Pig
 Glimpse of Hbase and Hive

“ Big data is a collection of data sets so large ”
and complex that it becomes difficult to
process using on-hand database
management tools

http://en.wikipedia.org/wiki/Big_data

1GB today is not the same
as 1GB just 10 years before

Anything that doesn’t fit
into the RAM of a single
machine

Data in Movement (streams)
 Twitter/Facebook comments
 Stock market data
 Access logs of a busy web server
 Sensors: Vital signs of a newly born

Data at rest (Oceans)
 Collection of what has streamed
 Emails or IM messages
 Social Media
 Unstructured documents: forms, claims

We have all this data and
need to find a way to
process them

Traditional way of scaling
(Scaling up)
 Make the machine more powerful
 Add more RAM
 Add more cores to CPU

 It is going to be very expensive
 Will be limited by disk seek and read time
 Single point of failure

New way to scale up (Scale out)
 Add more instances of the same machine
 Cost is less compared to scaling up
 Immune to failure of a single or a set of nodes
 Disk seek and write time is not going to be
bottleneck
 Future safe (to some extend)

Is it fit for ALL types of
problems?

A scalable, fault-tolerant
grid operating system for
data storage and processing

What is Hadoop?
 Runs on Commodity hardware
 HDFS: Fault-tolerant high-bandwidth clustered
storage
 MapReduce: Distributed data processing
 Works with structured and unstructured data
 Open source, Apache license
 Master (named-node) – Slave architecture

Design Principles
 System shall manage and heal itself
 Performance shall scale linearly
 Algorithm should move to data
 Lower latency, lower bandwidth
 Simple core, modular and extensible

Components of Hadoop
 HDFS
 Map Reduce
 PIG
 HBase
 Hive

Getting started with
Hadoop

What I am not going to cover?
 Installation or setting up Hadoop
 Will be running all the code in a single node instance
 Monitoring of the clusters
 Performance tuning
 User authentication or quota

Before we get into code,
let’s understand some
concepts

Framework for distributed
processing of large datasets

MapReduce
Consists of two functions
 Map
 Filter and transform the input, which the reducer
can understand
 Reduce
 Aggregate over the input provided by the Map
function

Formal definition
Map
<k1, v1> -> list(<k2,v2>)

Reduce
<k2, list(v2)> -> list <k3, v3>

Count number of words in files
Map
<file_name, file_contents> => list<word, count>

Reduce
<word, list(count)> => <word, sum_of_counts>

Map
<“file1”, “to be or not to be”> =>
{<“to”,1>,
<“be”,1>,
<“or”,1>,
<“not”,1>,
<“to,1>,
<“be”,1>}

Reduce
{<“to”,<1,1>>, <“be”,<1,1>>, <“or”,<1>>,
<“not”,<1>>}

=>

{<“to”,2>, <“be”,2>, <“or”,1>, <“not”,1>}

Max temperature in a year
Map
<file_name, file_contents> => <year, temp>

Reduce
<year, list(temp)> => <year, max_temp>

HDFS
 Distributed file system
 Data is distributed over different nodes
 Will be replicated for fail over
 Is abstracted out for the algorithms

HDFS Commands
 hadoop fs –mkdir <dir_name>
 hadoop fs –ls <dir_name>
 hadoop fs –rmr <dir_name>
 hadoop fs –put <local_file> <remote_dir>
 hadoop fs –get <remote_file> <local_dir>
 hadoop fs –cat <remote_file>
 hadoop fs –help

Count Words Demo
 Create a mapper class
 Override map() method
 Create a reducer class
 Override reduce() method
 Create a main method
 Create JAR
 Run it on Hadoop

Map Method
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {

String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);

while (itr.hasMoreTokens()) {
context.write(new Text(itr.nextToken()), new
IntWritable(1));
}
}

Reduce Method
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {

int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}

Main Method
Job job = new Job();
job.setJarByClass(CountWords.class);
job.setJobName("Count Words");

FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(CountWordsMapper.class);

job.setReducerClass(CountWordsReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

Run it on Hadoop

hadoop jar dist/countwords.jar
com.sudarmuthu.hadoop.countwords.CountWord
s input/ output/

Output
at 1
be 3
can 7
can't 1
code 2
command 1
connect 1
consider 1
continued 1
control 4
could 1
couple 1
courtesy 1
desktop, 1
detailed 1
details 1
…..
…..

What is Pig?
Pig provides an abstraction for processing large
datasets

Consists of
 Pig Latin – Language to express data flows

 Execution environment

Why we need Pig?
 MapReduce can get complex if your data needs
lot of processing/transformations
 MapReduce provides primitive data structures
 Pig provides rich data structures
 Supports complex operations like joins

Running Pig programs
 In an interactive shell called Grunt
 As a Pig Script
 Embedded into Java programs (like JDBC)

Grunt shell
 fs commands – like hadoop fs
 fs –ls
 Fs –mkdir

 fs copyToLocal <file>
 fs copyFromLocal <local_file> <dest>
 exec – execute Pig scripts
 sh – execute shell scripts

Pig Latin
 LOAD – Read files
 DUMP – Dump data in the console
 JOIN – Do a join on data sets
 FILTER – Filter data sets
 SORT – Sort data
 STORE – Store data back in files

Filter words present in a list

What is Hbase?
 Distributed, column-oriented database built on
top of HDFS
 Useful when real-time read/write random-access
to very large datasets is needed.
 Can handle billions of rows with millions of
columns

What is Hive?
 Useful for managing and querying structured
data
 Provides SQL like syntax
 Meta data is stored in a RDBMS
 Extensible with types, functions , scripts etc

Hadoop Relational Databases
 Affordable  Interactive response times
Storage/Compute  ACID
 Structured or Unstructured  Structured data
 Resilient Auto Scalability  Cost/Scale prohibitive

Hands on Hadoop and pig

More Related Content

What's hot

Viewers also liked

Similar to Hands on Hadoop and pig

More from Sudar Muthu

Recently uploaded

Hands on Hadoop and pig