Hadoop, Pig, and Python (PyData NYC 2012)
Upcoming SlideShare
Loading in...5
×
 

Hadoop, Pig, and Python (PyData NYC 2012)

on

  • 5,451 views

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Statistics

Views

Total Views
5,451
Views on SlideShare
4,477
Embed Views
974

Actions

Likes
3
Downloads
84
Comments
0

6 Embeds 974

http://blog.mortardata.com 877
http://www.tumblr.com 79
http://severian14.okeedo.com 13
https://twitter.com 3
http://newsblur.com 1
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop, Pig, and Python (PyData NYC 2012) Hadoop, Pig, and Python (PyData NYC 2012) Presentation Transcript

  • Hadoop, Pig, and PythonPyData NYC 2012
  • OverviewOF THIS SESSIONWhy Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency,how to start)
  • Too much dataFOR ONE MACHINEData doubles every 18 mo
  • ETL / MungingCleanseFormatSimple calculations
  • Social Graph
  • Predict
  • Detect
  • Genetics
  • HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  • HadoopRAPID OVERVIEW
  • HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  • HadoopPROBLEMSDifficultNot much PythonBatch only (...or it was)
  • HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  • HadoopAND PYTHON
  • JythonON HADOOP (MAP)
  • JythonON HADOOP (REDUCE; 1ST HALF)
  • JythonON HADOOP (REDUCE; 2ND HALF)
  • JythonON HADOOP
  • PythonON HADOOPStreaming(Works with any language, not just
  • MrJob (Python)ON HADOOPStreaming + local / EMR / your Hadoop
  • MrJob (Python)ON HADOOPMulti-step jobs
  • PigON HADOOPLess codeExpressive code
  • PigBRIEF, EXPRESSIVE(thanks: twitter hadoop world presentation)
  • The Same Script, InFOR SERIOUS
  • PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  • PigON HADOOPWorks with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions
  • Pig + PythonON HADOOP
  • Hadoop + PythonNOT ACTUALLY MAGICHadoop won’t magically parallelizeyour algorithm
  • Hadoop + PythonEFFICIENCYDon’t stream Java-based languages• Jython• Pig + JythonStreaming has ~30% overhead• Python• MrJob• Pig + Python
  • Hadoop + PythonEXCITED?Well... 90-95% of time isn’t spent onalgos
  • Hadoop + PythonHARD STUFF: SETUPGet Hadoop runningSoftware where it needs to beProcesses communicatingData available
  • Hadoop + PythonHARD STUFF: DEVELOPLearnProject structure, modularityDev environment like Production
  • Hadoop + PythonHARD STUFF: VALIDATESyntax checkPackages availableData readableData writableWithout long waits for failure
  • Hadoop + PythonHARD STUFF: DEBUGDistributed execution is hard to debug
  • Hadoop + PythonHARD STUFF: TESTData processing is hard to testBut critical
  • Hadoop + PythonHARD STUFF: DEPLOYEnvironments identicalCode correctly deployedConfiguration changesNon-disruptive
  • Hadoop + PythonHARD STUFF: HISTORYStats about prior runsWhat code was run?What’s changed?
  • Hadoop + PythonHARD STUFF: LOGSDistributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs
  • Hadoop + PythonHARD STUFF: MORTAR’S APPROACHSetup: PaaS, pip installation,connectorsDevelop: learning, structure, instantdev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy
  • K Young @kky