2. Pig at Twitter
● Hundreds of users
● Thousands of Scripts
● Tens of thousands of daily jobs
● Many hundreds of internal UDFS
But it has been deprecated internally. Why?
3. Learning curve
complexity of code base
difficultytoimplement
Raw MR somewhere up there...
Scalding
Pig
Scala
FP
compiling a Job
Consistent syntax (UDFs, etc)
testability
Ease of using existing JVM infrastructure
This is where
Twitter is
But a ton of really useful
work happens here!
Scripting language
Simple enough syntax
1 to 1 UDFs
Many to 1 UDFs
Testing
Scheduling
Deploying
Debugging weird errors
Inconsistent syntax
4. What does Pig do well?
● It lets you get started quickly
o Great for exploring data sets
● It let’s you describe your flow easily
o Much more maintainable than SQL for ETL
● Have smart people working on making it work well
on Spark, Tez
o Community!
For many (most?) companies, these benefits far
outweigh the negatives
5. What doesn’t Pig do well?
● Software engineering “in the large”
o Testing
o IDE support
● Consistency
o Grammar is inconsistent
o “Type system” inconsistent as well
● Aging code base
o Take a look at POForEach… If you can explain how
Accumulators are implemented, I’ll buy you a bottle
of scotch. You’ll need it
6. Evolution of data at a company
● A couple scripts to make some reports
o “The data guy”
● A team whose job is to write and maintain pipelines that
others use
o “The data team”
● A team might guide analytics infrastructure decisions,
but many teams have analysts and engineers writing
and maintaining pipelines
o “The data singularity”
o This is where Pig is not as strong
7. ● My perspective
o Big data teams
o Analytics, data scientists, and data engineers spread
across the organization
o Most companies aren’t like this
But they’re the ones driving a lot of investment in
these tools
● How can we ensure Pig is still useful in 5+
years?
So...what?
8. ● Adding types is horrendous
o Types were tacked on after the fact
● Pig could be a LOT faster
o Use code generation based on type information for
more memory and CPU efficient pipelines
● Lots of UDF boilerplate
o We shouldn’t need to check in a file for every single
function in Java’s standard library
● Pig internally is very stateful and difficult to reason
about
A smattering of issues
9. A smattering of issues (cont)
● Testing is clunky
● Composition in Pig is poor
o Macros are clunky
● Development is tricky
o A constrained DSL with a type system could have a
really, really powerful IDE
10. Fixing Pig
● Rewrite it all in Haskell!!
o Or Scala… or even Java!
● But really: needs cleanup
● Current development model is unsustainable
o Tez support has taken a group of Pig experts quite a
while to get at all close to working
o Sophisticated new features should be able to be
written by smart non-experts in a modular fashion
11. TLDR
● Pig is a very useful tool!
o Tons of mature functionality
o Lots of successful deployments
o I dare anyone in the Hadoop ecosystem circa 2006
to do a better job
● But it’s long in the tooth
o Mainly when it comes to “big company” issues
o An incumbent ripe for disruption