This document discusses using Apache Pig as a tool for researchers to analyze large datasets. It notes that researchers are motivated by their subject area, but may lack skills in tools like MapReduce. Pig provides a simpler language called Pig Latin to process data in Hadoop without needing MapReduce skills. The document provides an overview of Pig, examples of Pig Latin code, and tips for using Pig including taking samples of data for testing. It recommends Pig as a way for researchers to utilize Hadoop for dataset analysis without deep technical skills.
1. Apache Pig as a researcher‟s
stepping stone
Ben O‟Steen @benosteen
ben.osteen@bl.uk
2. www.bl.uk 2
Motivation:
• (Anecdotally) Researchers are motivated by their subject.
– Tools and techniques are interesting to them if it can help
further their knowledge and mastery in their chosen field.
3. www.bl.uk 3
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
4. www.bl.uk 4
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:
– Ours: Small compute over a series of large, messy datasets
– HPC: Large compute over “small” well, characterised input
datasets
5. www.bl.uk 5
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:
– Ours: Small compute over a series of large, messy datasets
– HPC: Large compute over “small” well, characterised input
datasets
• What‟s the minimum a researcher needs to learn, in order to
make use of compute clouds?
6. www.bl.uk 6
What choices are there?
• Excel, while ubiquitous, has limitations especially when
dealing with semi-structured data.
• OpenRefine is a fine choice, but has its own pros and cons.
• General purpose computing environment
– I‟m biased but this is a great choice but not an easy sell to
task-focussed people.
• Tailored computuing environment
– R, SciPy, MatLab, and so on.
7. www.bl.uk 7
What about Hadoop?
• Industry backing and use.
• Open and subscription-free.
• Write once, run on any cluster
– Well, mostly.
• Clusters can be „spun up‟ on demand from a number of
providers (eg AWS, Azure)
• Lovely. But…
8. www.bl.uk 8
Researchers and distributed computing
• The idea of trying to teach Map-Reduce or related
techniques to a task-focussed researcher doesn‟t appeal.
9. www.bl.uk 9
Researchers and distributed computing
• The idea of trying to teach how to do Map-Reduce in Java
to a task-focussed researcher doesn‟t appeal at all.
10. www.bl.uk 10
Hiding Hadoop
• Large number of projects built on top of Hadoop
– Using the Hadoop framework, but presenting a different way
to utilise it
• Hbase, Mahout, Hive, and of course, Pig
11. www.bl.uk 11
Why Pig?
• From the wiki:
“Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them,
and applying functions to records or groups of records.
Pig comes with many built-in functions but you can also
create your own user-defined functions to do special-purpose
processing.”
13. www.bl.uk 13
Pig‟s Philosophy
• Pigs eat anything
• Pigs live anywhere
• Pigs are domestic animals
• Pigs fly
(from Programming Pig, by Alan Gates)
14. www.bl.uk 14
What does Pig Latin look like?
raw = LOAD 'c19/metadatalist' AS (id, pubdate);
dates = FOREACH raw GENERATE id as id, pubdate as
pubdate;
date_group = GROUP dates BY pubdate;
STORE date_group INTO 'c19/date_group';
15. www.bl.uk 15
Write once…
• The pig script couldn‟t care less whether:
– the dataset is 12 Mb or 12 Tb
– it is running on a small VM or a huge cluster
– the dataset is a sample dataset only
16. www.bl.uk 16
Some tips
• Distributed computing‟s Hello World is a word-count
(a.txt is a big list of words, one per line)
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as num_rows;
17. www.bl.uk 17
Some tips
• “sample = SAMPLE raw 0.01”
– Keyword that will take a random sampling (0.01 or 1%) of
some source data („raw‟), rather than process the lot. Great
for testing.