1. BigData using Hadoop and
Pig
Sudar Muthu
Research Engineer
Yahoo Labs
http://sudarmuthu.com
http://twitter.com/sudarmuthu
2. Who am I?
Research Engineer at Yahoo Labs
Mines useful information from huge datasets
Worked on both structured and unstructured
data.
Builds robots as hobby ;)
3. What we will see today?
What is BigData?
Get our hands dirty with Hadoop
See some code
Try out Pig
Glimpse of Hbase and Hive
5. “ Big data is a collection of data sets so large ”
and complex that it becomes difficult to
process using on-hand database
management tools
http://en.wikipedia.org/wiki/Big_data
10. Data in Movement (streams)
Twitter/Facebook comments
Stock market data
Access logs of a busy web server
Sensors: Vital signs of a newly born
11. Data at rest (Oceans)
Collection of what has streamed
Emails or IM messages
Social Media
Unstructured documents: forms, claims
12. We have all this data and
need to find a way to
process them
13. Traditional way of scaling
(Scaling up)
Make the machine more powerful
Add more RAM
Add more cores to CPU
It is going to be very expensive
Will be limited by disk seek and read time
Single point of failure
14. New way to scale up (Scale out)
Add more instances of the same machine
Cost is less compared to scaling up
Immune to failure of a single or a set of nodes
Disk seek and write time is not going to be
bottleneck
Future safe (to some extend)
19. What is Hadoop?
Runs on Commodity hardware
HDFS: Fault-tolerant high-bandwidth clustered
storage
MapReduce: Distributed data processing
Works with structured and unstructured data
Open source, Apache license
Master (named-node) – Slave architecture
20. Design Principles
System shall manage and heal itself
Performance shall scale linearly
Algorithm should move to data
Lower latency, lower bandwidth
Simple core, modular and extensible
23. What I am not going to cover?
Installation or setting up Hadoop
Will be running all the code in a single node instance
Monitoring of the clusters
Performance tuning
User authentication or quota
24. Before we get into code,
let’s understand some
concepts
27. MapReduce
Consists of two functions
Map
Filter and transform the input, which the reducer
can understand
Reduce
Aggregate over the input provided by the Map
function
35. HDFS
Distributed file system
Data is distributed over different nodes
Will be replicated for fail over
Is abstracted out for the algorithms
41. Count Words Demo
Create a mapper class
Override map() method
Create a reducer class
Override reduce() method
Create a main method
Create JAR
Run it on Hadoop
42. Map Method
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer itr = new StringTokenizer(line);
while (itr.hasMoreTokens()) {
context.write(new Text(itr.nextToken()), new
IntWritable(1));
}
}
43. Reduce Method
public void reduce(Text key, Iterable<IntWritable>
values, Context context) throws IOException,
InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(key, new IntWritable(sum));
}
44. Main Method
Job job = new Job();
job.setJarByClass(CountWords.class);
job.setJobName("Count Words");
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(CountWordsMapper.class);
job.setReducerClass(CountWordsReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
45. Run it on Hadoop
hadoop jar dist/countwords.jar
com.sudarmuthu.hadoop.countwords.CountWord
s input/ output/
46. Output
at 1
be 3
can 7
can't 1
code 2
command 1
connect 1
consider 1
continued 1
control 4
could 1
couple 1
courtesy 1
desktop, 1
detailed 1
details 1
…..
…..
48. What is Pig?
Pig provides an abstraction for processing large
datasets
Consists of
Pig Latin – Language to express data flows
Execution environment
49. Why we need Pig?
MapReduce can get complex if your data needs
lot of processing/transformations
MapReduce provides primitive data structures
Pig provides rich data structures
Supports complex operations like joins
50. Running Pig programs
In an interactive shell called Grunt
As a Pig Script
Embedded into Java programs (like JDBC)
54. Pig Latin
LOAD – Read files
DUMP – Dump data in the console
JOIN – Do a join on data sets
FILTER – Filter data sets
SORT – Sort data
STORE – Store data back in files
59. What is Hbase?
Distributed, column-oriented database built on
top of HDFS
Useful when real-time read/write random-access
to very large datasets is needed.
Can handle billions of rows with millions of
columns
61. What is Hive?
Useful for managing and querying structured
data
Provides SQL like syntax
Meta data is stored in a RDBMS
Extensible with types, functions , scripts etc
62. Hadoop Relational Databases
Affordable Interactive response times
Storage/Compute ACID
Structured or Unstructured Structured data
Resilient Auto Scalability Cost/Scale prohibitive