Hadoop, Pig, and Python
PyData NYC 2012
Overview
OF THIS SESSION




Why Python on Hadoop?
Fast Hadoop overview
Jython
Python
MrJob
Pig
(How they work, challenges, efficiency,
how to start)
Too much data
FOR ONE MACHINE




Data doubles every 18 mo
ETL / Munging




Cleanse
Format
Simple calculations
Social Graph
Predict
Detect
Genetics
Hadoop
RAPID OVERVIEW




MapReduce programming model
from Google
(Jeff Dean and Sanjay Ghemawat)
Hadoop
RAPID OVERVIEW
Hadoop
RAPID OVERVIEW




Hadoop implements MapReduce (Java)
(Doug Cutting)
Incubated at Yahoo
Indexing, Spam detection, more
Hadoop
PROBLEMS




Difficult
Not much Python
Batch only (...or it was)
Hadoop
FUTURE




Yarn
MapReduce optional
Generic management + distributed
apps
Impala
Hadoop
AND PYTHON
Jython
ON HADOOP (MAP)
Jython
ON HADOOP (REDUCE; 1ST HALF)
Jython
ON HADOOP (REDUCE; 2ND HALF)
Jython
ON HADOOP
Python
ON HADOOP




Streaming




(Works with any language, not just
MrJob (Python)
ON HADOOP




Streaming + local / EMR / your Hadoop
MrJob (Python)
ON HADOOP




Multi-step jobs
Pig
ON HADOOP




Less code
Expressive code
Pig
BRIEF, EXPRESSIVE




(thanks: twitter hadoop world presentation)
The Same Script, In
FOR SERIOUS
Pig
ON HADOOP




Less code
Expressive code
Compiles to MR
Insulates from API
Popular
(LinkedIn, Twitter,
Salesforce, Yahoo,
Stanford
Pig
ON HADOOP




Works with Jython
Not Python
Stream, no types
UDF read stdin
UDF deserialize, no types
Serialize for Pig
Write to stdout
Exceptions
Pig + Python
ON HADOOP
Hadoop + Python
NOT ACTUALLY MAGIC




Hadoop won’t magically parallelize
your algorithm
Hadoop + Python
EFFICIENCY




Don’t stream Java-based languages
• Jython
• Pig + Jython


Streaming has ~30% overhead
• Python
• MrJob
• Pig + Python
Hadoop + Python
EXCITED?




Well... 90-95% of time isn’t spent on
algos
Hadoop + Python
HARD STUFF: SETUP




Get Hadoop running
Software where it needs to be
Processes communicating
Data available
Hadoop + Python
HARD STUFF: DEVELOP




Learn
Project structure, modularity
Dev environment like Production
Hadoop + Python
HARD STUFF: VALIDATE




Syntax check
Packages available
Data readable
Data writable
Without long waits for failure
Hadoop + Python
HARD STUFF: DEBUG




Distributed execution is hard to debug
Hadoop + Python
HARD STUFF: TEST




Data processing is hard to test
But critical
Hadoop + Python
HARD STUFF: DEPLOY




Environments identical
Code correctly deployed
Configuration changes
Non-disruptive
Hadoop + Python
HARD STUFF: HISTORY




Stats about prior runs
What code was run?
What’s changed?
Hadoop + Python
HARD STUFF: LOGS




Distributed logs hard to make sense of
Hadoop logs hard to understand
Ephemeral clusters lose logs
Hadoop + Python
HARD STUFF: MORTAR’S APPROACH




Setup: PaaS, pip installation,
connectors
Develop: learning, structure, instant
dev env
Validate: fast validate
Debug: printf, more coming
Test: Rails-like test suites
Deploy: one-button deploy
K Young
 @kky

Hadoop, Pig, and Python (PyData NYC 2012)