Apache pig as a researcher’s stepping stone

•Download as PPTX, PDF•

0 likes•573 views

This document discusses using Apache Pig as a tool for researchers to analyze large datasets. It notes that researchers are motivated by their subject area, but may lack skills in tools like MapReduce. Pig provides a simpler language called Pig Latin to process data in Hadoop without needing MapReduce skills. The document provides an overview of Pig, examples of Pig Latin code, and tips for using Pig including taking samples of data for testing. It recommends Pig as a way for researchers to utilize Hadoop for dataset analysis without deep technical skills.

Education

Apache Pig as a researcher‟s
stepping stone
Ben O‟Steen @benosteen
ben.osteen@bl.uk

www.bl.uk 2
Motivation:
• (Anecdotally) Researchers are motivated by their subject.
– Tools and techniques are interesting to them if it can help
further their knowledge and mastery in their chosen field.

www.bl.uk 3
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.

www.bl.uk 4
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:
– Ours: Small compute over a series of large, messy datasets
– HPC: Large compute over “small” well, characterised input
datasets

www.bl.uk 5
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:
– Ours: Small compute over a series of large, messy datasets
– HPC: Large compute over “small” well, characterised input
datasets
• What‟s the minimum a researcher needs to learn, in order to
make use of compute clouds?

www.bl.uk 6
What choices are there?
• Excel, while ubiquitous, has limitations especially when
dealing with semi-structured data.
• OpenRefine is a fine choice, but has its own pros and cons.
• General purpose computing environment
– I‟m biased but this is a great choice but not an easy sell to
task-focussed people.
• Tailored computuing environment
– R, SciPy, MatLab, and so on.

www.bl.uk 7
What about Hadoop?
• Industry backing and use.
• Open and subscription-free.
• Write once, run on any cluster
– Well, mostly.
• Clusters can be „spun up‟ on demand from a number of
providers (eg AWS, Azure)
• Lovely. But…

www.bl.uk 8
Researchers and distributed computing
• The idea of trying to teach Map-Reduce or related
techniques to a task-focussed researcher doesn‟t appeal.

www.bl.uk 9
Researchers and distributed computing
• The idea of trying to teach how to do Map-Reduce in Java
to a task-focussed researcher doesn‟t appeal at all.

www.bl.uk 10
Hiding Hadoop
• Large number of projects built on top of Hadoop
– Using the Hadoop framework, but presenting a different way
to utilise it
• Hbase, Mahout, Hive, and of course, Pig

www.bl.uk 11
Why Pig?
• From the wiki:
“Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them,
and applying functions to records or groups of records.
Pig comes with many built-in functions but you can also
create your own user-defined functions to do special-purpose
processing.”

www.bl.uk 13
Pig‟s Philosophy
• Pigs eat anything
• Pigs live anywhere
• Pigs are domestic animals
• Pigs fly
(from Programming Pig, by Alan Gates)

www.bl.uk 14
What does Pig Latin look like?
raw = LOAD 'c19/metadatalist' AS (id, pubdate);
dates = FOREACH raw GENERATE id as id, pubdate as
pubdate;
date_group = GROUP dates BY pubdate;
STORE date_group INTO 'c19/date_group';

www.bl.uk 15
Write once…
• The pig script couldn‟t care less whether:
– the dataset is 12 Mb or 12 Tb
– it is running on a small VM or a huge cluster
– the dataset is a sample dataset only

www.bl.uk 16
Some tips
• Distributed computing‟s Hello World is a word-count
(a.txt is a big list of words, one per line)
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as num_rows;

www.bl.uk 17
Some tips
• “sample = SAMPLE raw 0.01”
– Keyword that will take a random sampling (0.01 or 1%) of
some source data („raw‟), rather than process the lot. Great
for testing.

www.bl.uk 18
BNB and C19thC scripts
• See https://github.com/bl-labs

What's hot

A Hadoop Primersogrady

JOSA TechTalks - Big Data on HadoopJordan Open Source Association

Geek campjdhok

Big data Analytics hands-on sessionsPraveen Hanchinal

Hadoop 101 - Big Data TechnologyFirman Gautama

Hadoop 101 v2John Berns

140614 bigdatacamp-la-keynote-jon hsiehData Con LA

Hadoop online training usa ukMagnific Trainings

Another Intro To HadoopAdeel Ahmad

Data Visualization on the Tech SideMathieu Elie

Big data PPT Nitesh Dubey

HadoopGagan Agrawal

elasticsearch basics workshopMathieu Elie

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

Hadoop breizhjugDavid Morin

Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud

Emphemeral hadoop clusters in the cloudgfodor

DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media

Hadoop and MapReduceHemanth Kumar Mantri

Getting Started on HadoopPaco Nathan

What's hot (20)

A Hadoop Primer

JOSA TechTalks - Big Data on Hadoop

Geek camp

Big data Analytics hands-on sessions

Hadoop 101 - Big Data Technology

Hadoop 101 v2

140614 bigdatacamp-la-keynote-jon hsieh

Hadoop online training usa uk

Another Intro To Hadoop

Data Visualization on the Tech Side

Big data PPT

Hadoop

elasticsearch basics workshop

Hadoop Pig: MapReduce the easy way!

Hadoop breizhjug

Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase

Emphemeral hadoop clusters in the cloud

DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify

Hadoop and MapReduce

Getting Started on Hadoop

Viewers also liked

Lightning Talk - LDCX 2015 Stanfordbenosteen

104 Communicating our Collections Onlinebenosteen

Visualising Knowledge: Why? What? How?benosteen

New methods of access and discoverability bring new affordances for digital r...benosteen

Mechanical curator - Technical notesbenosteen

UKSG 2015 Mechanical curator and British Library labsbenosteen

NDF,Te Papa, New Zealand 2015 - Keynotebenosteen

Uses of Library Collectionsbenosteen

Big Data VisualizationRaffael Marty

Data visualizationJan Willem Tulp

Viewers also liked (10)

Lightning Talk - LDCX 2015 Stanford

104 Communicating our Collections Online

Visualising Knowledge: Why? What? How?

New methods of access and discoverability bring new affordances for digital r...

Mechanical curator - Technical notes

UKSG 2015 Mechanical curator and British Library labs

NDF,Te Papa, New Zealand 2015 - Keynote

Uses of Library Collections

Big Data Visualization

Data visualization

Similar to Apache pig as a researcher’s stepping stone

PyData Boston 2013Travis Oliphant

Python as the Zen of Data ScienceTravis Oliphant

Big data nyuEdward Capriolo

Ceph in 2023 and Beyond.pdfClyso GmbH

2014 pycon-talkc.titus.brown

Elastic Data Analytics Platform @DatadogC4Media

From a student to an apache committer practice of apache io tdbjixuan1989

Fast and Scalable PythonTravis Oliphant

CloudStack and BigDataSebastien Goasguen

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka

$Introduction to Ansible. Meetup Infracoders$ $Introduction to Ansible. Meetup Infracoders$

Introduction to Ansible. Meetup InfracodersJosé Manuel Molero

"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group

R, Hadoop and Amazon Web ServicesPortland R User Group

Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore

SparkNitish Upreti

Architecting Your First Big Data ImplementationAdaryl "Bob" Wakefield, MBA

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler

An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek

Hadoop PrimerSteve Staso

Hadoop Data ModelingAdam Doyle

Similar to Apache pig as a researcher’s stepping stone (20)

PyData Boston 2013

Python as the Zen of Data Science

Big data nyu

Ceph in 2023 and Beyond.pdf

2014 pycon-talk

Elastic Data Analytics Platform @Datadog

From a student to an apache committer practice of apache io tdb

Fast and Scalable Python

CloudStack and BigData

First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA

$Introduction to Ansible. Meetup Infracoders$ $Introduction to Ansible. Meetup Infracoders$

Introduction to Ansible. Meetup Infracoders

"R, Hadoop, and Amazon Web Services (20 December 2011)"

R, Hadoop and Amazon Web Services

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Spark

Architecting Your First Big Data Implementation

Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)

An Introduction to Apache Hadoop, Mahout and HBase

Hadoop Primer

Hadoop Data Modeling

Recently uploaded

Student login on Anyboli platform.helpinRaunakKeshri1

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching

Paris 2024 Olympic Geographies - an activityGeoBlogs

Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB

A Critique of the Proposed National Education Policy ReformChameera Dedduwage

Mattingly "AI & Prompt Design: The Basics of Prompt Design"National Information Standards Organization (NISO)

microwave assisted reaction. General introductionMaksud Ahmed

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur

Advance Mobile Application Development class 07Dr. Mazin Mohamed alkathiri

Mastering the Unannounced Regulatory InspectionSafetyChain Software

Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732

Activity 01 - Artificial Culture (1).pdfciinovamais

Nutritional Needs Presentation - HLTH 104misteraugie

CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2

Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre

1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh

1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh

social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3

Interactive Powerpoint_How to Master effective communicationnomboosow

The basics of sentences session 2pptx copy.pptxheathfieldcps1

Recently uploaded (20)

Student login on Anyboli platform.helpin

Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...

Paris 2024 Olympic Geographies - an activity

Beyond the EU: DORA and NIS 2 Directive's Global Impact

A Critique of the Proposed National Education Policy Reform

Mattingly "AI & Prompt Design: The Basics of Prompt Design"

microwave assisted reaction. General introduction

BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...

Advance Mobile Application Development class 07

Mastering the Unannounced Regulatory Inspection

Separation of Lanthanides/ Lanthanides and Actinides

Activity 01 - Artificial Culture (1).pdf

Nutritional Needs Presentation - HLTH 104

CARE OF CHILD IN INCUBATOR..........pptx

Organic Name Reactions for the students and aspirants of Chemistry12th.pptx

1029 - Danh muc Sach Giao Khoa 10 . pdf

1029-Danh muc Sach Giao Khoa khoi 6.pdf

social pharmacy d-pharm 1st year by Pragati K. Mahajan

Interactive Powerpoint_How to Master effective communication

The basics of sentences session 2pptx copy.pptx

Apache pig as a researcher’s stepping stone

1. Apache Pig as a researcher‟s stepping stone Ben O‟Steen @benosteen ben.osteen@bl.uk

2. www.bl.uk 2 Motivation: • (Anecdotally) Researchers are motivated by their subject. – Tools and techniques are interesting to them if it can help further their knowledge and mastery in their chosen field.

3. www.bl.uk 3 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™.

4. www.bl.uk 4 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™. • Different problem to typical HPC: – Ours: Small compute over a series of large, messy datasets – HPC: Large compute over “small” well, characterised input datasets

5. www.bl.uk 5 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™. • Different problem to typical HPC: – Ours: Small compute over a series of large, messy datasets – HPC: Large compute over “small” well, characterised input datasets • What‟s the minimum a researcher needs to learn, in order to make use of compute clouds?

6. www.bl.uk 6 What choices are there? • Excel, while ubiquitous, has limitations especially when dealing with semi-structured data. • OpenRefine is a fine choice, but has its own pros and cons. • General purpose computing environment – I‟m biased but this is a great choice but not an easy sell to task-focussed people. • Tailored computuing environment – R, SciPy, MatLab, and so on.

7. www.bl.uk 7 What about Hadoop? • Industry backing and use. • Open and subscription-free. • Write once, run on any cluster – Well, mostly. • Clusters can be „spun up‟ on demand from a number of providers (eg AWS, Azure) • Lovely. But…

8. www.bl.uk 8 Researchers and distributed computing • The idea of trying to teach Map-Reduce or related techniques to a task-focussed researcher doesn‟t appeal.

9. www.bl.uk 9 Researchers and distributed computing • The idea of trying to teach how to do Map-Reduce in Java to a task-focussed researcher doesn‟t appeal at all.

10. www.bl.uk 10 Hiding Hadoop • Large number of projects built on top of Hadoop – Using the Hadoop framework, but presenting a different way to utilise it • Hbase, Mahout, Hive, and of course, Pig

11. www.bl.uk 11 Why Pig? • From the wiki: “Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing.”

12. www.bl.uk 12

13. www.bl.uk 13 Pig‟s Philosophy • Pigs eat anything • Pigs live anywhere • Pigs are domestic animals • Pigs fly (from Programming Pig, by Alan Gates)

14. www.bl.uk 14 What does Pig Latin look like? raw = LOAD 'c19/metadatalist' AS (id, pubdate); dates = FOREACH raw GENERATE id as id, pubdate as pubdate; date_group = GROUP dates BY pubdate; STORE date_group INTO 'c19/date_group';

15. www.bl.uk 15 Write once… • The pig script couldn‟t care less whether: – the dataset is 12 Mb or 12 Tb – it is running on a small VM or a huge cluster – the dataset is a sample dataset only

16. www.bl.uk 16 Some tips • Distributed computing‟s Hello World is a word-count (a.txt is a big list of words, one per line) a = load 'a.txt'; b = group a all; c = foreach b generate COUNT(a) as num_rows;

17. www.bl.uk 17 Some tips • “sample = SAMPLE raw 0.01” – Keyword that will take a random sampling (0.01 or 1%) of some source data („raw‟), rather than process the lot. Great for testing.

18. www.bl.uk 18 BNB and C19thC scripts • See https://github.com/bl-labs

19. www.bl.uk 19 Thank you

Apache pig as a researcher’s stepping stone

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Apache pig as a researcher’s stepping stone

Similar to Apache pig as a researcher’s stepping stone (20)

More from benosteen

More from benosteen (17)

Recently uploaded

Recently uploaded (20)

Apache pig as a researcher’s stepping stone