SlideShare a Scribd company logo
1 of 11
Why Pig?
Jonathan Coveney
@jco
Pig PMC
Formerly: comScore, Spotify, Twitter
Pig at Twitter
● Hundreds of users
● Thousands of Scripts
● Tens of thousands of daily jobs
● Many hundreds of internal UDFS
But it has been deprecated internally. Why?
Learning curve
complexity of code base
difficultytoimplement
Raw MR somewhere up there...
Scalding
Pig
Scala
FP
compiling a Job
Consistent syntax (UDFs, etc)
testability
Ease of using existing JVM infrastructure
This is where
Twitter is
But a ton of really useful
work happens here!
Scripting language
Simple enough syntax
1 to 1 UDFs
Many to 1 UDFs
Testing
Scheduling
Deploying
Debugging weird errors
Inconsistent syntax
What does Pig do well?
● It lets you get started quickly
o Great for exploring data sets
● It let’s you describe your flow easily
o Much more maintainable than SQL for ETL
● Have smart people working on making it work well
on Spark, Tez
o Community!
For many (most?) companies, these benefits far
outweigh the negatives
What doesn’t Pig do well?
● Software engineering “in the large”
o Testing
o IDE support
● Consistency
o Grammar is inconsistent
o “Type system” inconsistent as well
● Aging code base
o Take a look at POForEach… If you can explain how
Accumulators are implemented, I’ll buy you a bottle
of scotch. You’ll need it
Evolution of data at a company
● A couple scripts to make some reports
o “The data guy”
● A team whose job is to write and maintain pipelines that
others use
o “The data team”
● A team might guide analytics infrastructure decisions,
but many teams have analysts and engineers writing
and maintaining pipelines
o “The data singularity”
o This is where Pig is not as strong
● My perspective
o Big data teams
o Analytics, data scientists, and data engineers spread
across the organization
o Most companies aren’t like this
 But they’re the ones driving a lot of investment in
these tools
● How can we ensure Pig is still useful in 5+
years?
So...what?
● Adding types is horrendous
o Types were tacked on after the fact
● Pig could be a LOT faster
o Use code generation based on type information for
more memory and CPU efficient pipelines
● Lots of UDF boilerplate
o We shouldn’t need to check in a file for every single
function in Java’s standard library
● Pig internally is very stateful and difficult to reason
about
A smattering of issues
A smattering of issues (cont)
● Testing is clunky
● Composition in Pig is poor
o Macros are clunky
● Development is tricky
o A constrained DSL with a type system could have a
really, really powerful IDE
Fixing Pig
● Rewrite it all in Haskell!!
o Or Scala… or even Java!
● But really: needs cleanup
● Current development model is unsustainable
o Tez support has taken a group of Pig experts quite a
while to get at all close to working
o Sophisticated new features should be able to be
written by smart non-experts in a modular fashion
TLDR
● Pig is a very useful tool!
o Tons of mature functionality
o Lots of successful deployments
o I dare anyone in the Hadoop ecosystem circa 2006
to do a better job
● But it’s long in the tooth
o Mainly when it comes to “big company” issues
o An incumbent ripe for disruption

More Related Content

What's hot

Python VS GO
Python VS GOPython VS GO
Python VS GOOfir Nir
 
DIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitDIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitHenry Schreiner
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangSri Ambati
 
2019 IRIS-HEP AS workshop: Particles and decays
2019 IRIS-HEP AS workshop: Particles and decays2019 IRIS-HEP AS workshop: Particles and decays
2019 IRIS-HEP AS workshop: Particles and decaysHenry Schreiner
 
H2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaH2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaSri Ambati
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Holden Karau
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiGrowth Intelligence
 
PyCon Ukraine 2016: Maintaining a high load Python project for newcomers
PyCon Ukraine 2016: Maintaining a high load Python project for newcomersPyCon Ukraine 2016: Maintaining a high load Python project for newcomers
PyCon Ukraine 2016: Maintaining a high load Python project for newcomersViach Kakovskyi
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...University of California, San Diego
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science Domino Data Lab
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit fasterPatrick Bos
 
C# - Raise the bar with functional & immutable constructs (Dutch)
C# - Raise the bar with functional & immutable constructs (Dutch)C# - Raise the bar with functional & immutable constructs (Dutch)
C# - Raise the bar with functional & immutable constructs (Dutch)Rick Beerendonk
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012Dan Lynn
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageMarko Rodriguez
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
 

What's hot (20)

NANO266 - Lecture 9 - Tools of the Modeling Trade
NANO266 - Lecture 9 - Tools of the Modeling TradeNANO266 - Lecture 9 - Tools of the Modeling Trade
NANO266 - Lecture 9 - Tools of the Modeling Trade
 
Python VS GO
Python VS GOPython VS GO
Python VS GO
 
DIANA: Recent developments in GooFit
DIANA: Recent developments in GooFitDIANA: Recent developments in GooFit
DIANA: Recent developments in GooFit
 
H2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy WangH2O World - Intro to R, Python, and Flow - Amy Wang
H2O World - Intro to R, Python, and Flow - Amy Wang
 
2019 IRIS-HEP AS workshop: Particles and decays
2019 IRIS-HEP AS workshop: Particles and decays2019 IRIS-HEP AS workshop: Particles and decays
2019 IRIS-HEP AS workshop: Particles and decays
 
H2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi MehtaH2O World - PySparkling Water - Nidhi Mehta
H2O World - PySparkling Water - Nidhi Mehta
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
PyCon Ukraine 2016: Maintaining a high load Python project for newcomers
PyCon Ukraine 2016: Maintaining a high load Python project for newcomersPyCon Ukraine 2016: Maintaining a high load Python project for newcomers
PyCon Ukraine 2016: Maintaining a high load Python project for newcomers
 
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...The Materials Project Ecosystem - A Complete Software and Data Platform for M...
The Materials Project Ecosystem - A Complete Software and Data Platform for M...
 
GPU Computing for Data Science
GPU Computing for Data Science GPU Computing for Data Science
GPU Computing for Data Science
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
C# - Raise the bar with functional & immutable constructs (Dutch)
C# - Raise the bar with functional & immutable constructs (Dutch)C# - Raise the bar with functional & immutable constructs (Dutch)
C# - Raise the bar with functional & immutable constructs (Dutch)
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
 
Storm: The Real-Time Layer - GlueCon 2012
Storm: The Real-Time Layer  - GlueCon 2012Storm: The Real-Time Layer  - GlueCon 2012
Storm: The Real-Time Layer - GlueCon 2012
 
ACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and LanguageACM DBPL Keynote: The Graph Traversal Machine and Language
ACM DBPL Keynote: The Graph Traversal Machine and Language
 
Elasticwulf Pycon Talk
Elasticwulf Pycon TalkElasticwulf Pycon Talk
Elasticwulf Pycon Talk
 
Prototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.orgPrototyping Data Intensive Apps: TrendingTopics.org
Prototyping Data Intensive Apps: TrendingTopics.org
 
SFrame
SFrameSFrame
SFrame
 

Viewers also liked

Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Timesmortardata
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to HivemallMakoto Yui
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigDataWorks Summit/Hadoop Summit
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkVolker Hirsch
 

Viewers also liked (9)

Daeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York TimesDaeil Kim: Machine Learning at the New York Times
Daeil Kim: Machine Learning at the New York Times
 
Pig on Spark
Pig on SparkPig on Spark
Pig on Spark
 
The Knowledge Management Advantage
The Knowledge Management AdvantageThe Knowledge Management Advantage
The Knowledge Management Advantage
 
HBase Data Types
HBase Data TypesHBase Data Types
HBase Data Types
 
Introduction to Hivemall
Introduction to HivemallIntroduction to Hivemall
Introduction to Hivemall
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
 
TEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of WorkTEDx Manchester: AI & The Future of Work
TEDx Manchester: AI & The Future of Work
 
Build Features, Not Apps
Build Features, Not AppsBuild Features, Not Apps
Build Features, Not Apps
 

Similar to Jonathan Coveney: Why Pig?

Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporeDhruv Gohil
 
How to code in the XXI century without losing your head
How to code in the XXI century without losing your headHow to code in the XXI century without losing your head
How to code in the XXI century without losing your headRené Olivo
 
Running Neo4j in Production: Tips, Tricks and Optimizations
Running Neo4j in Production:  Tips, Tricks and OptimizationsRunning Neo4j in Production:  Tips, Tricks and Optimizations
Running Neo4j in Production: Tips, Tricks and OptimizationsNick Manning
 
Running Neo4j in Production: Tips, Tricks and Optimizations
Running Neo4j in Production:  Tips, Tricks and OptimizationsRunning Neo4j in Production:  Tips, Tricks and Optimizations
Running Neo4j in Production: Tips, Tricks and OptimizationsNick Manning
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
 
Poing: a coder’s take on protein modelling
Poing: a coder’s take on protein modellingPoing: a coder’s take on protein modelling
Poing: a coder’s take on protein modellingBiogeeks
 
Do Languages Matter?
Do Languages Matter?Do Languages Matter?
Do Languages Matter?Bruce Eckel
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet CodeDavid Danzilio
 
Scala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyBozhidar Bozhanov
 
PHP to Python with No Regrets
PHP to Python with No RegretsPHP to Python with No Regrets
PHP to Python with No RegretsAlex Ezell
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Community
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...Stefano Fago
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkNavid Kalaei
 
The *on-going* future of Perl5
The *on-going* future of Perl5The *on-going* future of Perl5
The *on-going* future of Perl5Vytautas Dauksa
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Holden Karau
 

Similar to Jonathan Coveney: Why Pig? (20)

Prototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at SingaporePrototype4Production Presented at FOSSASIA2015 at Singapore
Prototype4Production Presented at FOSSASIA2015 at Singapore
 
How to code in the XXI century without losing your head
How to code in the XXI century without losing your headHow to code in the XXI century without losing your head
How to code in the XXI century without losing your head
 
Running Neo4j in Production: Tips, Tricks and Optimizations
Running Neo4j in Production:  Tips, Tricks and OptimizationsRunning Neo4j in Production:  Tips, Tricks and Optimizations
Running Neo4j in Production: Tips, Tricks and Optimizations
 
Running Neo4j in Production: Tips, Tricks and Optimizations
Running Neo4j in Production:  Tips, Tricks and OptimizationsRunning Neo4j in Production:  Tips, Tricks and Optimizations
Running Neo4j in Production: Tips, Tricks and Optimizations
 
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsPyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" Ecosystems
 
Poing: a coder’s take on protein modelling
Poing: a coder’s take on protein modellingPoing: a coder’s take on protein modelling
Poing: a coder’s take on protein modelling
 
Do Languages Matter?
Do Languages Matter?Do Languages Matter?
Do Languages Matter?
 
200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code200,000 Lines Later: Our Journey to Manageable Puppet Code
200,000 Lines Later: Our Journey to Manageable Puppet Code
 
Scala - the good, the bad and the very ugly
Scala - the good, the bad and the very uglyScala - the good, the bad and the very ugly
Scala - the good, the bad and the very ugly
 
groovy & grails - lecture 1
groovy & grails - lecture 1groovy & grails - lecture 1
groovy & grails - lecture 1
 
PHP to Python with No Regrets
PHP to Python with No RegretsPHP to Python with No Regrets
PHP to Python with No Regrets
 
Apache pig
Apache pigApache pig
Apache pig
 
Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote Ceph Day SF 2015 - Keynote
Ceph Day SF 2015 - Keynote
 
Learning to code in 2020
Learning to code in 2020Learning to code in 2020
Learning to code in 2020
 
What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...What drives Innovation? Innovations And Technological Solutions for the Distr...
What drives Innovation? Innovations And Technological Solutions for the Distr...
 
Spaghetti gate
Spaghetti gateSpaghetti gate
Spaghetti gate
 
How to Choose a Deep Learning Framework
How to Choose a Deep Learning FrameworkHow to Choose a Deep Learning Framework
How to Choose a Deep Learning Framework
 
The *on-going* future of Perl5
The *on-going* future of Perl5The *on-going* future of Perl5
The *on-going* future of Perl5
 
Java basics
Java basicsJava basics
Java basics
 
Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...Simplifying training deep and serving learning models with big data in python...
Simplifying training deep and serving learning models with big data in python...
 

More from mortardata

Can Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake PorwayCan Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake Porwaymortardata
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Sciencemortardata
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblrmortardata
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …mortardata
 

More from mortardata (6)

Can Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake PorwayCan Big Data Save the World? By Jake Porway
Can Big Data Save the World? By Jake Porway
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
Drew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data ScienceDrew Conway: A Social Scientist's Perspective on Data Science
Drew Conway: A Social Scientist's Perspective on Data Science
 
Data Science at Tumblr
Data Science at TumblrData Science at Tumblr
Data Science at Tumblr
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
Mortar: Hadoop-as-a-Service + Open Source Framework | AWS re: Invent public …
 

Recently uploaded

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Jonathan Coveney: Why Pig?

  • 1. Why Pig? Jonathan Coveney @jco Pig PMC Formerly: comScore, Spotify, Twitter
  • 2. Pig at Twitter ● Hundreds of users ● Thousands of Scripts ● Tens of thousands of daily jobs ● Many hundreds of internal UDFS But it has been deprecated internally. Why?
  • 3. Learning curve complexity of code base difficultytoimplement Raw MR somewhere up there... Scalding Pig Scala FP compiling a Job Consistent syntax (UDFs, etc) testability Ease of using existing JVM infrastructure This is where Twitter is But a ton of really useful work happens here! Scripting language Simple enough syntax 1 to 1 UDFs Many to 1 UDFs Testing Scheduling Deploying Debugging weird errors Inconsistent syntax
  • 4. What does Pig do well? ● It lets you get started quickly o Great for exploring data sets ● It let’s you describe your flow easily o Much more maintainable than SQL for ETL ● Have smart people working on making it work well on Spark, Tez o Community! For many (most?) companies, these benefits far outweigh the negatives
  • 5. What doesn’t Pig do well? ● Software engineering “in the large” o Testing o IDE support ● Consistency o Grammar is inconsistent o “Type system” inconsistent as well ● Aging code base o Take a look at POForEach… If you can explain how Accumulators are implemented, I’ll buy you a bottle of scotch. You’ll need it
  • 6. Evolution of data at a company ● A couple scripts to make some reports o “The data guy” ● A team whose job is to write and maintain pipelines that others use o “The data team” ● A team might guide analytics infrastructure decisions, but many teams have analysts and engineers writing and maintaining pipelines o “The data singularity” o This is where Pig is not as strong
  • 7. ● My perspective o Big data teams o Analytics, data scientists, and data engineers spread across the organization o Most companies aren’t like this  But they’re the ones driving a lot of investment in these tools ● How can we ensure Pig is still useful in 5+ years? So...what?
  • 8. ● Adding types is horrendous o Types were tacked on after the fact ● Pig could be a LOT faster o Use code generation based on type information for more memory and CPU efficient pipelines ● Lots of UDF boilerplate o We shouldn’t need to check in a file for every single function in Java’s standard library ● Pig internally is very stateful and difficult to reason about A smattering of issues
  • 9. A smattering of issues (cont) ● Testing is clunky ● Composition in Pig is poor o Macros are clunky ● Development is tricky o A constrained DSL with a type system could have a really, really powerful IDE
  • 10. Fixing Pig ● Rewrite it all in Haskell!! o Or Scala… or even Java! ● But really: needs cleanup ● Current development model is unsustainable o Tez support has taken a group of Pig experts quite a while to get at all close to working o Sophisticated new features should be able to be written by smart non-experts in a modular fashion
  • 11. TLDR ● Pig is a very useful tool! o Tons of mature functionality o Lots of successful deployments o I dare anyone in the Hadoop ecosystem circa 2006 to do a better job ● But it’s long in the tooth o Mainly when it comes to “big company” issues o An incumbent ripe for disruption