Hadoop, Pig, and Python (PyData NYC 2012)

  • 4,780 views
Uploaded on

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
4,780
On Slideshare
0
From Embeds
0
Number of Embeds
7

Actions

Shares
Downloads
96
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Hadoop, Pig, and PythonPyData NYC 2012
  • 2. OverviewOF THIS SESSIONWhy Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency,how to start)
  • 3. Too much dataFOR ONE MACHINEData doubles every 18 mo
  • 4. ETL / MungingCleanseFormatSimple calculations
  • 5. Social Graph
  • 6. Predict
  • 7. Detect
  • 8. Genetics
  • 9. HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  • 10. HadoopRAPID OVERVIEW
  • 11. HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  • 12. HadoopPROBLEMSDifficultNot much PythonBatch only (...or it was)
  • 13. HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  • 14. HadoopAND PYTHON
  • 15. JythonON HADOOP (MAP)
  • 16. JythonON HADOOP (REDUCE; 1ST HALF)
  • 17. JythonON HADOOP (REDUCE; 2ND HALF)
  • 18. JythonON HADOOP
  • 19. PythonON HADOOPStreaming(Works with any language, not just
  • 20. MrJob (Python)ON HADOOPStreaming + local / EMR / your Hadoop
  • 21. MrJob (Python)ON HADOOPMulti-step jobs
  • 22. PigON HADOOPLess codeExpressive code
  • 23. PigBRIEF, EXPRESSIVE(thanks: twitter hadoop world presentation)
  • 24. The Same Script, InFOR SERIOUS
  • 25. PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  • 26. PigON HADOOPWorks with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions
  • 27. Pig + PythonON HADOOP
  • 28. Hadoop + PythonNOT ACTUALLY MAGICHadoop won’t magically parallelizeyour algorithm
  • 29. Hadoop + PythonEFFICIENCYDon’t stream Java-based languages• Jython• Pig + JythonStreaming has ~30% overhead• Python• MrJob• Pig + Python
  • 30. Hadoop + PythonEXCITED?Well... 90-95% of time isn’t spent onalgos
  • 31. Hadoop + PythonHARD STUFF: SETUPGet Hadoop runningSoftware where it needs to beProcesses communicatingData available
  • 32. Hadoop + PythonHARD STUFF: DEVELOPLearnProject structure, modularityDev environment like Production
  • 33. Hadoop + PythonHARD STUFF: VALIDATESyntax checkPackages availableData readableData writableWithout long waits for failure
  • 34. Hadoop + PythonHARD STUFF: DEBUGDistributed execution is hard to debug
  • 35. Hadoop + PythonHARD STUFF: TESTData processing is hard to testBut critical
  • 36. Hadoop + PythonHARD STUFF: DEPLOYEnvironments identicalCode correctly deployedConfiguration changesNon-disruptive
  • 37. Hadoop + PythonHARD STUFF: HISTORYStats about prior runsWhat code was run?What’s changed?
  • 38. Hadoop + PythonHARD STUFF: LOGSDistributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs
  • 39. Hadoop + PythonHARD STUFF: MORTAR’S APPROACHSetup: PaaS, pip installation,connectorsDevelop: learning, structure, instantdev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy
  • 40. K Young @kky