Apache pig as a researcher’s stepping stone

357 views
268 views

Published on

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
357
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Apache pig as a researcher’s stepping stone

  1. 1. Apache Pig as a researcher‟s stepping stone Ben O‟Steen @benosteen ben.osteen@bl.uk
  2. 2. www.bl.uk 2 Motivation: • (Anecdotally) Researchers are motivated by their subject. – Tools and techniques are interesting to them if it can help further their knowledge and mastery in their chosen field.
  3. 3. www.bl.uk 3 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™.
  4. 4. www.bl.uk 4 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™. • Different problem to typical HPC: – Ours: Small compute over a series of large, messy datasets – HPC: Large compute over “small” well, characterised input datasets
  5. 5. www.bl.uk 5 My Problem: • We have a lot of data. – More that will fit on researcher‟s workstations but not what HPC people consider Big Data™. • Different problem to typical HPC: – Ours: Small compute over a series of large, messy datasets – HPC: Large compute over “small” well, characterised input datasets • What‟s the minimum a researcher needs to learn, in order to make use of compute clouds?
  6. 6. www.bl.uk 6 What choices are there? • Excel, while ubiquitous, has limitations especially when dealing with semi-structured data. • OpenRefine is a fine choice, but has its own pros and cons. • General purpose computing environment – I‟m biased but this is a great choice but not an easy sell to task-focussed people. • Tailored computuing environment – R, SciPy, MatLab, and so on.
  7. 7. www.bl.uk 7 What about Hadoop? • Industry backing and use. • Open and subscription-free. • Write once, run on any cluster – Well, mostly. • Clusters can be „spun up‟ on demand from a number of providers (eg AWS, Azure) • Lovely. But…
  8. 8. www.bl.uk 8 Researchers and distributed computing • The idea of trying to teach Map-Reduce or related techniques to a task-focussed researcher doesn‟t appeal.
  9. 9. www.bl.uk 9 Researchers and distributed computing • The idea of trying to teach how to do Map-Reduce in Java to a task-focussed researcher doesn‟t appeal at all.
  10. 10. www.bl.uk 10 Hiding Hadoop • Large number of projects built on top of Hadoop – Using the Hadoop framework, but presenting a different way to utilise it • Hbase, Mahout, Hive, and of course, Pig
  11. 11. www.bl.uk 11 Why Pig? • From the wiki: “Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing.”
  12. 12. www.bl.uk 12
  13. 13. www.bl.uk 13 Pig‟s Philosophy • Pigs eat anything • Pigs live anywhere • Pigs are domestic animals • Pigs fly (from Programming Pig, by Alan Gates)
  14. 14. www.bl.uk 14 What does Pig Latin look like? raw = LOAD 'c19/metadatalist' AS (id, pubdate); dates = FOREACH raw GENERATE id as id, pubdate as pubdate; date_group = GROUP dates BY pubdate; STORE date_group INTO 'c19/date_group';
  15. 15. www.bl.uk 15 Write once… • The pig script couldn‟t care less whether: – the dataset is 12 Mb or 12 Tb – it is running on a small VM or a huge cluster – the dataset is a sample dataset only
  16. 16. www.bl.uk 16 Some tips • Distributed computing‟s Hello World is a word-count (a.txt is a big list of words, one per line) a = load 'a.txt'; b = group a all; c = foreach b generate COUNT(a) as num_rows;
  17. 17. www.bl.uk 17 Some tips • “sample = SAMPLE raw 0.01” – Keyword that will take a random sampling (0.01 or 1%) of some source data („raw‟), rather than process the lot. Great for testing.
  18. 18. www.bl.uk 18 BNB and C19thC scripts • See https://github.com/bl-labs
  19. 19. www.bl.uk 19 Thank you

×