Jonathan Coveney: Why Pig?

•Download as PPTX, PDF•

2 likes•2,107 views

mortardata

Jonathan Coveney's talk to the NYC Pig User Group

Technology

Why Pig?
Jonathan Coveney
@jco
Pig PMC
Formerly: comScore, Spotify, Twitter

Pig at Twitter
● Hundreds of users
● Thousands of Scripts
● Tens of thousands of daily jobs
● Many hundreds of internal UDFS
But it has been deprecated internally. Why?

Learning curve
complexity of code base
difficultytoimplement
Raw MR somewhere up there...
Scalding
Pig
Scala
FP
compiling a Job
Consistent syntax (UDFs, etc)
testability
Ease of using existing JVM infrastructure
This is where
Twitter is
But a ton of really useful
work happens here!
Scripting language
Simple enough syntax
1 to 1 UDFs
Many to 1 UDFs
Testing
Scheduling
Deploying
Debugging weird errors
Inconsistent syntax

What does Pig do well?
● It lets you get started quickly
o Great for exploring data sets
● It let’s you describe your flow easily
o Much more maintainable than SQL for ETL
● Have smart people working on making it work well
on Spark, Tez
o Community!
For many (most?) companies, these benefits far
outweigh the negatives

What doesn’t Pig do well?
● Software engineering “in the large”
o Testing
o IDE support
● Consistency
o Grammar is inconsistent
o “Type system” inconsistent as well
● Aging code base
o Take a look at POForEach… If you can explain how
Accumulators are implemented, I’ll buy you a bottle
of scotch. You’ll need it

Evolution of data at a company
● A couple scripts to make some reports
o “The data guy”
● A team whose job is to write and maintain pipelines that
others use
o “The data team”
● A team might guide analytics infrastructure decisions,
but many teams have analysts and engineers writing
and maintaining pipelines
o “The data singularity”
o This is where Pig is not as strong

● My perspective
o Big data teams
o Analytics, data scientists, and data engineers spread
across the organization
o Most companies aren’t like this
 But they’re the ones driving a lot of investment in
these tools
● How can we ensure Pig is still useful in 5+
years?
So...what?

● Adding types is horrendous
o Types were tacked on after the fact
● Pig could be a LOT faster
o Use code generation based on type information for
more memory and CPU efficient pipelines
● Lots of UDF boilerplate
o We shouldn’t need to check in a file for every single
function in Java’s standard library
● Pig internally is very stateful and difficult to reason
about
A smattering of issues

A smattering of issues (cont)
● Testing is clunky
● Composition in Pig is poor
o Macros are clunky
● Development is tricky
o A constrained DSL with a type system could have a
really, really powerful IDE

Fixing Pig
● Rewrite it all in Haskell!!
o Or Scala… or even Java!
● But really: needs cleanup
● Current development model is unsustainable
o Tez support has taken a group of Pig experts quite a
while to get at all close to working
o Sophisticated new features should be able to be
written by smart non-experts in a modular fashion

TLDR
● Pig is a very useful tool!
o Tons of mature functionality
o Lots of successful deployments
o I dare anyone in the Hadoop ecosystem circa 2006
to do a better job
● But it’s long in the tooth
o Mainly when it comes to “big company” issues
o An incumbent ripe for disruption

What's hot

NANO266 - Lecture 9 - Tools of the Modeling TradeUniversity of California, San Diego

Python VS GOOfir Nir

DIANA: Recent developments in GooFitHenry Schreiner

H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati

2019 IRIS-HEP AS workshop: Particles and decaysHenry Schreiner

H2O World - PySparkling Water - Nidhi MehtaSri Ambati

Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf

A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence

PyCon Ukraine 2016: Maintaining a high load Python project for newcomersViach Kakovskyi

The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego

GPU Computing for Data Science Domino Data Lab

Making fitting in RooFit fasterPatrick Bos

C# - Raise the bar with functional & immutable constructs (Dutch)Rick Beerendonk

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France

Storm: The Real-Time Layer - GlueCon 2012Dan Lynn

ACM DBPL Keynote: The Graph Traversal Machine and LanguageMarko Rodriguez

Elasticwulf Pycon TalkPeter Skomoroch

Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch

SFrameTuri, Inc.

What's hot (20)

NANO266 - Lecture 9 - Tools of the Modeling Trade

Python VS GO

DIANA: Recent developments in GooFit

H2O World - Intro to R, Python, and Flow - Amy Wang

2019 IRIS-HEP AS workshop: Particles and decays

H2O World - PySparkling Water - Nidhi Mehta

Accelerating Big Data beyond the JVM - Fosdem 2018

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

A Beginner's Guide to Building Data Pipelines with Luigi

PyCon Ukraine 2016: Maintaining a high load Python project for newcomers

The Materials Project Ecosystem - A Complete Software and Data Platform for M...

GPU Computing for Data Science

Making fitting in RooFit faster

C# - Raise the bar with functional & immutable constructs (Dutch)

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...

Storm: The Real-Time Layer - GlueCon 2012

ACM DBPL Keynote: The Graph Traversal Machine and Language

Elasticwulf Pycon Talk

Prototyping Data Intensive Apps: TrendingTopics.org

SFrame

Viewers also liked

Daeil Kim: Machine Learning at the New York Timesmortardata

Pig on Sparkmortardata

The Knowledge Management AdvantageOlufunso Steve Olofinlade GPHR®, ODCP, CMP

HBase Data TypesNick Dimiduk

Introduction to HivemallMakoto Yui

Apache Big Data EU 2015 - HBaseNick Dimiduk

Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit

TEDx Manchester: AI & The Future of WorkVolker Hirsch

Build Features, Not AppsNatasha Murashev

Viewers also liked (9)

Daeil Kim: Machine Learning at the New York Times

Pig on Spark

The Knowledge Management Advantage

HBase Data Types

Introduction to Hivemall

Apache Big Data EU 2015 - HBase

Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig

TEDx Manchester: AI & The Future of Work

Build Features, Not Apps

Similar to Jonathan Coveney: Why Pig?

Prototype4Production Presented at FOSSASIA2015 at SingaporeDhruv Gohil

How to code in the XXI century without losing your headRené Olivo

Running Neo4j in Production: Tips, Tricks and OptimizationsNick Manning

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn

Poing: a coder’s take on protein modellingBiogeeks

Do Languages Matter?Bruce Eckel

200,000 Lines Later: Our Journey to Manageable Puppet CodeDavid Danzilio

Scala - the good, the bad and the very uglyBozhidar Bozhanov

groovy & grails - lecture 1Alexandre Masselot

PHP to Python with No RegretsAlex Ezell

Apache pigSuresh Mandava

Ceph Day SF 2015 - Keynote Ceph Community

Learning to code in 2020Nicholas Sterling

What drives Innovation? Innovations And Technological Solutions for the Distr...Stefano Fago

Spaghetti gateJon Bachelor

How to Choose a Deep Learning FrameworkNavid Kalaei

The *on-going* future of Perl5Vytautas Dauksa

Java basicsHoang Nguyen

Simplifying training deep and serving learning models with big data in python...Holden Karau

Similar to Jonathan Coveney: Why Pig? (20)

Prototype4Production Presented at FOSSASIA2015 at Singapore

How to code in the XXI century without losing your head

Running Neo4j in Production: Tips, Tricks and Optimizations

PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems

Poing: a coder’s take on protein modelling

Do Languages Matter?

200,000 Lines Later: Our Journey to Manageable Puppet Code

Scala - the good, the bad and the very ugly

groovy & grails - lecture 1

PHP to Python with No Regrets

Apache pig

Ceph Day SF 2015 - Keynote

Learning to code in 2020

What drives Innovation? Innovations And Technological Solutions for the Distr...

Spaghetti gate

How to Choose a Deep Learning Framework

The *on-going* future of Perl5

Java basics

Simplifying training deep and serving learning models with big data in python...

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Advanced Computer Architecture – An IntroductionDilum Bandara

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

How to write a Business Continuity PlanDatabarracks

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

From Family Reminiscence to Scholarly Archive .Alan Dix

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Connect Wave/ connectwave Pitch Deck Presentation

DevoxxFR 2024 Reproducible Builds with Apache Maven

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Advanced Computer Architecture – An Introduction

How AI, OpenAI, and ChatGPT impact business and software.

How to write a Business Continuity Plan

SIP trunking in Janus @ Kamailio World 2024

From Family Reminiscence to Scholarly Archive .

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

DevEX - reference for building teams, processes, and platforms

DMCC Future of Trade Web3 - Special Edition

TeamStation AI System Report LATAM IT Salaries 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

What's New in Teams Calling, Meetings and Devices March 2024

Unleash Your Potential - Namagunga Girls Coding Club

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Generative AI for Technical Writer or Information Developers

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

Jonathan Coveney: Why Pig?

1. Why Pig? Jonathan Coveney @jco Pig PMC Formerly: comScore, Spotify, Twitter

2. Pig at Twitter ● Hundreds of users ● Thousands of Scripts ● Tens of thousands of daily jobs ● Many hundreds of internal UDFS But it has been deprecated internally. Why?

3. Learning curve complexity of code base difficultytoimplement Raw MR somewhere up there... Scalding Pig Scala FP compiling a Job Consistent syntax (UDFs, etc) testability Ease of using existing JVM infrastructure This is where Twitter is But a ton of really useful work happens here! Scripting language Simple enough syntax 1 to 1 UDFs Many to 1 UDFs Testing Scheduling Deploying Debugging weird errors Inconsistent syntax

4. What does Pig do well? ● It lets you get started quickly o Great for exploring data sets ● It let’s you describe your flow easily o Much more maintainable than SQL for ETL ● Have smart people working on making it work well on Spark, Tez o Community! For many (most?) companies, these benefits far outweigh the negatives

5. What doesn’t Pig do well? ● Software engineering “in the large” o Testing o IDE support ● Consistency o Grammar is inconsistent o “Type system” inconsistent as well ● Aging code base o Take a look at POForEach… If you can explain how Accumulators are implemented, I’ll buy you a bottle of scotch. You’ll need it

6. Evolution of data at a company ● A couple scripts to make some reports o “The data guy” ● A team whose job is to write and maintain pipelines that others use o “The data team” ● A team might guide analytics infrastructure decisions, but many teams have analysts and engineers writing and maintaining pipelines o “The data singularity” o This is where Pig is not as strong

7. ● My perspective o Big data teams o Analytics, data scientists, and data engineers spread across the organization o Most companies aren’t like this  But they’re the ones driving a lot of investment in these tools ● How can we ensure Pig is still useful in 5+ years? So...what?

8. ● Adding types is horrendous o Types were tacked on after the fact ● Pig could be a LOT faster o Use code generation based on type information for more memory and CPU efficient pipelines ● Lots of UDF boilerplate o We shouldn’t need to check in a file for every single function in Java’s standard library ● Pig internally is very stateful and difficult to reason about A smattering of issues

9. A smattering of issues (cont) ● Testing is clunky ● Composition in Pig is poor o Macros are clunky ● Development is tricky o A constrained DSL with a type system could have a really, really powerful IDE

10. Fixing Pig ● Rewrite it all in Haskell!! o Or Scala… or even Java! ● But really: needs cleanup ● Current development model is unsustainable o Tez support has taken a group of Pig experts quite a while to get at all close to working o Sophisticated new features should be able to be written by smart non-experts in a modular fashion

11. TLDR ● Pig is a very useful tool! o Tons of mature functionality o Lots of successful deployments o I dare anyone in the Hadoop ecosystem circa 2006 to do a better job ● But it’s long in the tooth o Mainly when it comes to “big company” issues o An incumbent ripe for disruption

Jonathan Coveney: Why Pig?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Jonathan Coveney: Why Pig?

Similar to Jonathan Coveney: Why Pig? (20)

More from mortardata

More from mortardata (6)

Recently uploaded

Recently uploaded (20)

Jonathan Coveney: Why Pig?