Hadoop, Pig, and PythonPyData NYC 2012
OverviewOF THIS SESSIONWhy Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency...
Too much dataFOR ONE MACHINEData doubles every 18 mo
ETL / MungingCleanseFormatSimple calculations
Social Graph
Predict
Detect
Genetics
HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
HadoopRAPID OVERVIEW
HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
HadoopPROBLEMSDifficultNot much PythonBatch only (...or it was)
HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
HadoopAND PYTHON
JythonON HADOOP (MAP)
JythonON HADOOP (REDUCE; 1ST HALF)
JythonON HADOOP (REDUCE; 2ND HALF)
JythonON HADOOP
PythonON HADOOPStreaming(Works with any language, not just
MrJob (Python)ON HADOOPStreaming + local / EMR / your Hadoop
MrJob (Python)ON HADOOPMulti-step jobs
PigON HADOOPLess codeExpressive code
PigBRIEF, EXPRESSIVE(thanks: twitter hadoop world presentation)
The Same Script, InFOR SERIOUS
PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
PigON HADOOPWorks with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to st...
Pig + PythonON HADOOP
Hadoop + PythonNOT ACTUALLY MAGICHadoop won’t magically parallelizeyour algorithm
Hadoop + PythonEFFICIENCYDon’t stream Java-based languages• Jython• Pig + JythonStreaming has ~30% overhead• Python• MrJob...
Hadoop + PythonEXCITED?Well... 90-95% of time isn’t spent onalgos
Hadoop + PythonHARD STUFF: SETUPGet Hadoop runningSoftware where it needs to beProcesses communicatingData available
Hadoop + PythonHARD STUFF: DEVELOPLearnProject structure, modularityDev environment like Production
Hadoop + PythonHARD STUFF: VALIDATESyntax checkPackages availableData readableData writableWithout long waits for failure
Hadoop + PythonHARD STUFF: DEBUGDistributed execution is hard to debug
Hadoop + PythonHARD STUFF: TESTData processing is hard to testBut critical
Hadoop + PythonHARD STUFF: DEPLOYEnvironments identicalCode correctly deployedConfiguration changesNon-disruptive
Hadoop + PythonHARD STUFF: HISTORYStats about prior runsWhat code was run?What’s changed?
Hadoop + PythonHARD STUFF: LOGSDistributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose...
Hadoop + PythonHARD STUFF: MORTAR’S APPROACHSetup: PaaS, pip installation,connectorsDevelop: learning, structure, instantd...
K Young @kky
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Upcoming SlideShare
Loading in...5
×

Hadoop, Pig, and Python (PyData NYC 2012)

5,417

Published on

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,417
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
121
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Hadoop, Pig, and Python (PyData NYC 2012)

  1. 1. Hadoop, Pig, and PythonPyData NYC 2012
  2. 2. OverviewOF THIS SESSIONWhy Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency,how to start)
  3. 3. Too much dataFOR ONE MACHINEData doubles every 18 mo
  4. 4. ETL / MungingCleanseFormatSimple calculations
  5. 5. Social Graph
  6. 6. Predict
  7. 7. Detect
  8. 8. Genetics
  9. 9. HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  10. 10. HadoopRAPID OVERVIEW
  11. 11. HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  12. 12. HadoopPROBLEMSDifficultNot much PythonBatch only (...or it was)
  13. 13. HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  14. 14. HadoopAND PYTHON
  15. 15. JythonON HADOOP (MAP)
  16. 16. JythonON HADOOP (REDUCE; 1ST HALF)
  17. 17. JythonON HADOOP (REDUCE; 2ND HALF)
  18. 18. JythonON HADOOP
  19. 19. PythonON HADOOPStreaming(Works with any language, not just
  20. 20. MrJob (Python)ON HADOOPStreaming + local / EMR / your Hadoop
  21. 21. MrJob (Python)ON HADOOPMulti-step jobs
  22. 22. PigON HADOOPLess codeExpressive code
  23. 23. PigBRIEF, EXPRESSIVE(thanks: twitter hadoop world presentation)
  24. 24. The Same Script, InFOR SERIOUS
  25. 25. PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  26. 26. PigON HADOOPWorks with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions
  27. 27. Pig + PythonON HADOOP
  28. 28. Hadoop + PythonNOT ACTUALLY MAGICHadoop won’t magically parallelizeyour algorithm
  29. 29. Hadoop + PythonEFFICIENCYDon’t stream Java-based languages• Jython• Pig + JythonStreaming has ~30% overhead• Python• MrJob• Pig + Python
  30. 30. Hadoop + PythonEXCITED?Well... 90-95% of time isn’t spent onalgos
  31. 31. Hadoop + PythonHARD STUFF: SETUPGet Hadoop runningSoftware where it needs to beProcesses communicatingData available
  32. 32. Hadoop + PythonHARD STUFF: DEVELOPLearnProject structure, modularityDev environment like Production
  33. 33. Hadoop + PythonHARD STUFF: VALIDATESyntax checkPackages availableData readableData writableWithout long waits for failure
  34. 34. Hadoop + PythonHARD STUFF: DEBUGDistributed execution is hard to debug
  35. 35. Hadoop + PythonHARD STUFF: TESTData processing is hard to testBut critical
  36. 36. Hadoop + PythonHARD STUFF: DEPLOYEnvironments identicalCode correctly deployedConfiguration changesNon-disruptive
  37. 37. Hadoop + PythonHARD STUFF: HISTORYStats about prior runsWhat code was run?What’s changed?
  38. 38. Hadoop + PythonHARD STUFF: LOGSDistributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs
  39. 39. Hadoop + PythonHARD STUFF: MORTAR’S APPROACHSetup: PaaS, pip installation,connectorsDevelop: learning, structure, instantdev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy
  40. 40. K Young @kky
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×