SlideShare a Scribd company logo
1 of 19
Apache Pig as a researcher‟s
stepping stone
Ben O‟Steen @benosteen
ben.osteen@bl.uk
www.bl.uk 2
Motivation:
• (Anecdotally) Researchers are motivated by their subject.
– Tools and techniques are interesting to them if it can help
further their knowledge and mastery in their chosen field.
www.bl.uk 3
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
www.bl.uk 4
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:
– Ours: Small compute over a series of large, messy datasets
– HPC: Large compute over “small” well, characterised input
datasets
www.bl.uk 5
My Problem:
• We have a lot of data.
– More that will fit on researcher‟s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:
– Ours: Small compute over a series of large, messy datasets
– HPC: Large compute over “small” well, characterised input
datasets
• What‟s the minimum a researcher needs to learn, in order to
make use of compute clouds?
www.bl.uk 6
What choices are there?
• Excel, while ubiquitous, has limitations especially when
dealing with semi-structured data.
• OpenRefine is a fine choice, but has its own pros and cons.
• General purpose computing environment
– I‟m biased but this is a great choice but not an easy sell to
task-focussed people.
• Tailored computuing environment
– R, SciPy, MatLab, and so on.
www.bl.uk 7
What about Hadoop?
• Industry backing and use.
• Open and subscription-free.
• Write once, run on any cluster
– Well, mostly.
• Clusters can be „spun up‟ on demand from a number of
providers (eg AWS, Azure)
• Lovely. But…
www.bl.uk 8
Researchers and distributed computing
• The idea of trying to teach Map-Reduce or related
techniques to a task-focussed researcher doesn‟t appeal.
www.bl.uk 9
Researchers and distributed computing
• The idea of trying to teach how to do Map-Reduce in Java
to a task-focussed researcher doesn‟t appeal at all.
www.bl.uk 10
Hiding Hadoop
• Large number of projects built on top of Hadoop
– Using the Hadoop framework, but presenting a different way
to utilise it
• Hbase, Mahout, Hive, and of course, Pig
www.bl.uk 11
Why Pig?
• From the wiki:
“Apache Pig is a platform for analyzing large data sets. Pig's
language, Pig Latin, lets you specify a sequence of data
transformations such as merging data sets, filtering them,
and applying functions to records or groups of records.
Pig comes with many built-in functions but you can also
create your own user-defined functions to do special-purpose
processing.”
www.bl.uk 12
www.bl.uk 13
Pig‟s Philosophy
• Pigs eat anything
• Pigs live anywhere
• Pigs are domestic animals
• Pigs fly
(from Programming Pig, by Alan Gates)
www.bl.uk 14
What does Pig Latin look like?
raw = LOAD 'c19/metadatalist' AS (id, pubdate);
dates = FOREACH raw GENERATE id as id, pubdate as
pubdate;
date_group = GROUP dates BY pubdate;
STORE date_group INTO 'c19/date_group';
www.bl.uk 15
Write once…
• The pig script couldn‟t care less whether:
– the dataset is 12 Mb or 12 Tb
– it is running on a small VM or a huge cluster
– the dataset is a sample dataset only
www.bl.uk 16
Some tips
• Distributed computing‟s Hello World is a word-count
(a.txt is a big list of words, one per line)
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as num_rows;
www.bl.uk 17
Some tips
• “sample = SAMPLE raw 0.01”
– Keyword that will take a random sampling (0.01 or 1%) of
some source data („raw‟), rather than process the lot. Great
for testing.
www.bl.uk 18
BNB and C19thC scripts
• See https://github.com/bl-labs
www.bl.uk 19
Thank you

More Related Content

What's hot

A Hadoop Primer
A Hadoop PrimerA Hadoop Primer
A Hadoop Primersogrady
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Big data Analytics hands-on sessions
Big data Analytics hands-on sessionsBig data Analytics hands-on sessions
Big data Analytics hands-on sessionsPraveen Hanchinal
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyFirman Gautama
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsiehData Con LA
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To HadoopAdeel Ahmad
 
Data Visualization on the Tech Side
Data Visualization on the Tech SideData Visualization on the Tech Side
Data Visualization on the Tech SideMathieu Elie
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshopMathieu Elie
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Nathan Bijnens
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjugDavid Morin
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloudEmphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloudgfodor
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify Dataconomy Media
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 

What's hot (20)

A Hadoop Primer
A Hadoop PrimerA Hadoop Primer
A Hadoop Primer
 
JOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on HadoopJOSA TechTalks - Big Data on Hadoop
JOSA TechTalks - Big Data on Hadoop
 
Geek camp
Geek campGeek camp
Geek camp
 
Big data Analytics hands-on sessions
Big data Analytics hands-on sessionsBig data Analytics hands-on sessions
Big data Analytics hands-on sessions
 
Hadoop 101 - Big Data Technology
Hadoop 101 - Big Data TechnologyHadoop 101 - Big Data Technology
Hadoop 101 - Big Data Technology
 
Hadoop 101 v2
Hadoop 101 v2Hadoop 101 v2
Hadoop 101 v2
 
140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh140614 bigdatacamp-la-keynote-jon hsieh
140614 bigdatacamp-la-keynote-jon hsieh
 
Hadoop online training usa uk
Hadoop online training usa ukHadoop online training usa uk
Hadoop online training usa uk
 
Another Intro To Hadoop
Another Intro To HadoopAnother Intro To Hadoop
Another Intro To Hadoop
 
Data Visualization on the Tech Side
Data Visualization on the Tech SideData Visualization on the Tech Side
Data Visualization on the Tech Side
 
Big data PPT
Big data PPT Big data PPT
Big data PPT
 
Hadoop
HadoopHadoop
Hadoop
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
 
Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!Hadoop Pig: MapReduce the easy way!
Hadoop Pig: MapReduce the easy way!
 
Hadoop breizhjug
Hadoop breizhjugHadoop breizhjug
Hadoop breizhjug
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
Emphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloudEmphemeral hadoop clusters in the cloud
Emphemeral hadoop clusters in the cloud
 
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
DN18 | The Data Janitor Returns | Daniel Molnar | Oberlo/Shopify
 
Hadoop and MapReduce
Hadoop and MapReduceHadoop and MapReduce
Hadoop and MapReduce
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 

Viewers also liked

Lightning Talk - LDCX 2015 Stanford
Lightning Talk - LDCX 2015 StanfordLightning Talk - LDCX 2015 Stanford
Lightning Talk - LDCX 2015 Stanfordbenosteen
 
104 Communicating our Collections Online
104 Communicating our Collections Online104 Communicating our Collections Online
104 Communicating our Collections Onlinebenosteen
 
Visualising Knowledge: Why? What? How?
Visualising Knowledge: Why? What? How?Visualising Knowledge: Why? What? How?
Visualising Knowledge: Why? What? How?benosteen
 
New methods of access and discoverability bring new affordances for digital r...
New methods of access and discoverability bring new affordances for digital r...New methods of access and discoverability bring new affordances for digital r...
New methods of access and discoverability bring new affordances for digital r...benosteen
 
Mechanical curator - Technical notes
Mechanical curator - Technical notesMechanical curator - Technical notes
Mechanical curator - Technical notesbenosteen
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labsbenosteen
 
NDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - KeynoteNDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - Keynotebenosteen
 
Uses of Library Collections
Uses of Library CollectionsUses of Library Collections
Uses of Library Collectionsbenosteen
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationRaffael Marty
 

Viewers also liked (10)

Lightning Talk - LDCX 2015 Stanford
Lightning Talk - LDCX 2015 StanfordLightning Talk - LDCX 2015 Stanford
Lightning Talk - LDCX 2015 Stanford
 
104 Communicating our Collections Online
104 Communicating our Collections Online104 Communicating our Collections Online
104 Communicating our Collections Online
 
Visualising Knowledge: Why? What? How?
Visualising Knowledge: Why? What? How?Visualising Knowledge: Why? What? How?
Visualising Knowledge: Why? What? How?
 
New methods of access and discoverability bring new affordances for digital r...
New methods of access and discoverability bring new affordances for digital r...New methods of access and discoverability bring new affordances for digital r...
New methods of access and discoverability bring new affordances for digital r...
 
Mechanical curator - Technical notes
Mechanical curator - Technical notesMechanical curator - Technical notes
Mechanical curator - Technical notes
 
UKSG 2015 Mechanical curator and British Library labs
UKSG 2015  Mechanical curator and British Library labsUKSG 2015  Mechanical curator and British Library labs
UKSG 2015 Mechanical curator and British Library labs
 
NDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - KeynoteNDF,Te Papa, New Zealand 2015 - Keynote
NDF,Te Papa, New Zealand 2015 - Keynote
 
Uses of Library Collections
Uses of Library CollectionsUses of Library Collections
Uses of Library Collections
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
Data visualization
Data visualizationData visualization
Data visualization
 

Similar to Apache pig as a researcher’s stepping stone

Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data ScienceTravis Oliphant
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfClyso GmbH
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogC4Media
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Introduction to Ansible. Meetup Infracoders
Introduction to Ansible. Meetup InfracodersIntroduction to Ansible. Meetup Infracoders
Introduction to Ansible. Meetup InfracodersJosé Manuel Molero
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"Portland R User Group
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-PatternsDouglas Moore
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Eric Baldeschwieler
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data ModelingAdam Doyle
 

Similar to Apache pig as a researcher’s stepping stone (20)

PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Python as the Zen of Data Science
Python as the Zen of Data SciencePython as the Zen of Data Science
Python as the Zen of Data Science
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Ceph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdfCeph in 2023 and Beyond.pdf
Ceph in 2023 and Beyond.pdf
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
CloudStack and BigData
CloudStack and BigDataCloudStack and BigData
CloudStack and BigData
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Introduction to Ansible. Meetup Infracoders
Introduction to Ansible. Meetup InfracodersIntroduction to Ansible. Meetup Infracoders
Introduction to Ansible. Meetup Infracoders
 
"R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)""R, Hadoop, and Amazon Web Services (20 December 2011)"
"R, Hadoop, and Amazon Web Services (20 December 2011)"
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
Teradata Partners Conference Oct 2014   Big Data Anti-PatternsTeradata Partners Conference Oct 2014   Big Data Anti-Patterns
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
 
Spark
SparkSpark
Spark
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
Hadoop - Where did it come from and what's next? (Pasadena Sept 2014)
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Hadoop Primer
Hadoop PrimerHadoop Primer
Hadoop Primer
 
Hadoop Data Modeling
Hadoop Data ModelingHadoop Data Modeling
Hadoop Data Modeling
 

More from benosteen

Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talkbenosteen
 
Bl labs ucl-services
Bl labs ucl-servicesBl labs ucl-services
Bl labs ucl-servicesbenosteen
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labsbenosteen
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017benosteen
 
CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016benosteen
 
British library labs - What? Why?
British library labs - What? Why?British library labs - What? Why?
British library labs - What? Why?benosteen
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipitybenosteen
 
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)benosteen
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curatorbenosteen
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorbenosteen
 
Postscript, books and binding
Postscript, books and bindingPostscript, books and binding
Postscript, books and bindingbenosteen
 
Open Bibliography, Citations and Scholarship
Open Bibliography, Citations and ScholarshipOpen Bibliography, Citations and Scholarship
Open Bibliography, Citations and Scholarshipbenosteen
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automationbenosteen
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS systembenosteen
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologiesbenosteen
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?benosteen
 

More from benosteen (17)

Arches Getty Brownbag Talk
Arches Getty Brownbag TalkArches Getty Brownbag Talk
Arches Getty Brownbag Talk
 
Bl labs ucl-services
Bl labs ucl-servicesBl labs ucl-services
Bl labs ucl-services
 
Bl labs what is british library labs
Bl labs   what is british library labsBl labs   what is british library labs
Bl labs what is british library labs
 
British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017British Library Labs - Overview Talk 2017
British Library Labs - Overview Talk 2017
 
CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016CityLIS talk, Feb 1st 2016
CityLIS talk, Feb 1st 2016
 
British library labs - What? Why?
British library labs - What? Why?British library labs - What? Why?
British library labs - What? Why?
 
Sharing and Serendipity
Sharing and SerendipitySharing and Serendipity
Sharing and Serendipity
 
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
Mechanical Curator (@ CREATE PUBLIC DOMAIN WORKSHOP FOR CREATIVE BUSINESSES)
 
BL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical CuratorBL Labs 2014 Symposium: The Mechanical Curator
BL Labs 2014 Symposium: The Mechanical Curator
 
The surprising adventures of the mechanical curator
The surprising adventures of the mechanical curatorThe surprising adventures of the mechanical curator
The surprising adventures of the mechanical curator
 
Mashspa
MashspaMashspa
Mashspa
 
Postscript, books and binding
Postscript, books and bindingPostscript, books and binding
Postscript, books and binding
 
Open Bibliography, Citations and Scholarship
Open Bibliography, Citations and ScholarshipOpen Bibliography, Citations and Scholarship
Open Bibliography, Citations and Scholarship
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 
Bodleian Library's DAMS system
Bodleian Library's DAMS systemBodleian Library's DAMS system
Bodleian Library's DAMS system
 
Choices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein OntologiesChoices, modelling and Frankenstein Ontologies
Choices, modelling and Frankenstein Ontologies
 
Where are Repository's Going?
Where are Repository's Going?Where are Repository's Going?
Where are Repository's Going?
 

Recently uploaded

Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 

Recently uploaded (20)

Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 

Apache pig as a researcher’s stepping stone

  • 1. Apache Pig as a researcher‟s stepping stone Ben O‟Steen @benosteen ben.osteen@bl.uk
  • 2. www.bl.uk 2 Motivation: • (Anecdotally) Researchers are motivated by their subject. – Tools and techniques are interesting to them if it can help further their knowledge and mastery in their chosen field.
  • 3. www.bl.uk 3 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™.
  • 4. www.bl.uk 4 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™. • Different problem to typical HPC: – Ours: Small compute over a series of large, messy datasets – HPC: Large compute over “small” well, characterised input datasets
  • 5. www.bl.uk 5 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™. • Different problem to typical HPC: – Ours: Small compute over a series of large, messy datasets – HPC: Large compute over “small” well, characterised input datasets • What‟s the minimum a researcher needs to learn, in order to make use of compute clouds?
  • 6. www.bl.uk 6 What choices are there? • Excel, while ubiquitous, has limitations especially when dealing with semi-structured data. • OpenRefine is a fine choice, but has its own pros and cons. • General purpose computing environment – I‟m biased but this is a great choice but not an easy sell to task-focussed people. • Tailored computuing environment – R, SciPy, MatLab, and so on.
  • 7. www.bl.uk 7 What about Hadoop? • Industry backing and use. • Open and subscription-free. • Write once, run on any cluster – Well, mostly. • Clusters can be „spun up‟ on demand from a number of providers (eg AWS, Azure) • Lovely. But…
  • 8. www.bl.uk 8 Researchers and distributed computing • The idea of trying to teach Map-Reduce or related techniques to a task-focussed researcher doesn‟t appeal.
  • 9. www.bl.uk 9 Researchers and distributed computing • The idea of trying to teach how to do Map-Reduce in Java to a task-focussed researcher doesn‟t appeal at all.
  • 10. www.bl.uk 10 Hiding Hadoop • Large number of projects built on top of Hadoop – Using the Hadoop framework, but presenting a different way to utilise it • Hbase, Mahout, Hive, and of course, Pig
  • 11. www.bl.uk 11 Why Pig? • From the wiki: “Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing.”
  • 13. www.bl.uk 13 Pig‟s Philosophy • Pigs eat anything • Pigs live anywhere • Pigs are domestic animals • Pigs fly (from Programming Pig, by Alan Gates)
  • 14. www.bl.uk 14 What does Pig Latin look like? raw = LOAD 'c19/metadatalist' AS (id, pubdate); dates = FOREACH raw GENERATE id as id, pubdate as pubdate; date_group = GROUP dates BY pubdate; STORE date_group INTO 'c19/date_group';
  • 15. www.bl.uk 15 Write once… • The pig script couldn‟t care less whether: – the dataset is 12 Mb or 12 Tb – it is running on a small VM or a huge cluster – the dataset is a sample dataset only
  • 16. www.bl.uk 16 Some tips • Distributed computing‟s Hello World is a word-count (a.txt is a big list of words, one per line) a = load 'a.txt'; b = group a all; c = foreach b generate COUNT(a) as num_rows;
  • 17. www.bl.uk 17 Some tips • “sample = SAMPLE raw 0.01” – Keyword that will take a random sampling (0.01 or 1%) of some source data („raw‟), rather than process the lot. Great for testing.
  • 18. www.bl.uk 18 BNB and C19thC scripts • See https://github.com/bl-labs