Hadoop, Pig, and Python (PyData NYC 2012)

6,284 views
5,972 views

Published on

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,284
On SlideShare
0
From Embeds
0
Number of Embeds
1,328
Actions
Shares
0
Downloads
130
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Hadoop, Pig, and Python (PyData NYC 2012)

  1. 1. Hadoop, Pig, and PythonPyData NYC 2012
  2. 2. OverviewOF THIS SESSIONWhy Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency,how to start)
  3. 3. Too much dataFOR ONE MACHINEData doubles every 18 mo
  4. 4. ETL / MungingCleanseFormatSimple calculations
  5. 5. Social Graph
  6. 6. Predict
  7. 7. Detect
  8. 8. Genetics
  9. 9. HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  10. 10. HadoopRAPID OVERVIEW
  11. 11. HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  12. 12. HadoopPROBLEMSDifficultNot much PythonBatch only (...or it was)
  13. 13. HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  14. 14. HadoopAND PYTHON
  15. 15. JythonON HADOOP (MAP)
  16. 16. JythonON HADOOP (REDUCE; 1ST HALF)
  17. 17. JythonON HADOOP (REDUCE; 2ND HALF)
  18. 18. JythonON HADOOP
  19. 19. PythonON HADOOPStreaming(Works with any language, not just
  20. 20. MrJob (Python)ON HADOOPStreaming + local / EMR / your Hadoop
  21. 21. MrJob (Python)ON HADOOPMulti-step jobs
  22. 22. PigON HADOOPLess codeExpressive code
  23. 23. PigBRIEF, EXPRESSIVE(thanks: twitter hadoop world presentation)
  24. 24. The Same Script, InFOR SERIOUS
  25. 25. PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  26. 26. PigON HADOOPWorks with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions
  27. 27. Pig + PythonON HADOOP
  28. 28. Hadoop + PythonNOT ACTUALLY MAGICHadoop won’t magically parallelizeyour algorithm
  29. 29. Hadoop + PythonEFFICIENCYDon’t stream Java-based languages• Jython• Pig + JythonStreaming has ~30% overhead• Python• MrJob• Pig + Python
  30. 30. Hadoop + PythonEXCITED?Well... 90-95% of time isn’t spent onalgos
  31. 31. Hadoop + PythonHARD STUFF: SETUPGet Hadoop runningSoftware where it needs to beProcesses communicatingData available
  32. 32. Hadoop + PythonHARD STUFF: DEVELOPLearnProject structure, modularityDev environment like Production
  33. 33. Hadoop + PythonHARD STUFF: VALIDATESyntax checkPackages availableData readableData writableWithout long waits for failure
  34. 34. Hadoop + PythonHARD STUFF: DEBUGDistributed execution is hard to debug
  35. 35. Hadoop + PythonHARD STUFF: TESTData processing is hard to testBut critical
  36. 36. Hadoop + PythonHARD STUFF: DEPLOYEnvironments identicalCode correctly deployedConfiguration changesNon-disruptive
  37. 37. Hadoop + PythonHARD STUFF: HISTORYStats about prior runsWhat code was run?What’s changed?
  38. 38. Hadoop + PythonHARD STUFF: LOGSDistributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs
  39. 39. Hadoop + PythonHARD STUFF: MORTAR’S APPROACHSetup: PaaS, pip installation,connectorsDevelop: learning, structure, instantdev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy
  40. 40. K Young @kky

×