Your SlideShare is downloading. ×
0
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop, Pig, and Python (PyData NYC 2012)

5,302

Published on

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Mortar CEO K Young's talk on using Python with Hadoop and Pig (vs. Jython and MapReduce), including NumPy, SciPy, and NLTK.

Published in: Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,302
On Slideshare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
113
Comments
0
Likes
5
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop, Pig, and PythonPyData NYC 2012
  • 2. OverviewOF THIS SESSIONWhy Python on Hadoop?Fast Hadoop overviewJythonPythonMrJobPig(How they work, challenges, efficiency,how to start)
  • 3. Too much dataFOR ONE MACHINEData doubles every 18 mo
  • 4. ETL / MungingCleanseFormatSimple calculations
  • 5. Social Graph
  • 6. Predict
  • 7. Detect
  • 8. Genetics
  • 9. HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  • 10. HadoopRAPID OVERVIEW
  • 11. HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  • 12. HadoopPROBLEMSDifficultNot much PythonBatch only (...or it was)
  • 13. HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  • 14. HadoopAND PYTHON
  • 15. JythonON HADOOP (MAP)
  • 16. JythonON HADOOP (REDUCE; 1ST HALF)
  • 17. JythonON HADOOP (REDUCE; 2ND HALF)
  • 18. JythonON HADOOP
  • 19. PythonON HADOOPStreaming(Works with any language, not just
  • 20. MrJob (Python)ON HADOOPStreaming + local / EMR / your Hadoop
  • 21. MrJob (Python)ON HADOOPMulti-step jobs
  • 22. PigON HADOOPLess codeExpressive code
  • 23. PigBRIEF, EXPRESSIVE(thanks: twitter hadoop world presentation)
  • 24. The Same Script, InFOR SERIOUS
  • 25. PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  • 26. PigON HADOOPWorks with JythonNot PythonStream, no typesUDF read stdinUDF deserialize, no typesSerialize for PigWrite to stdoutExceptions
  • 27. Pig + PythonON HADOOP
  • 28. Hadoop + PythonNOT ACTUALLY MAGICHadoop won’t magically parallelizeyour algorithm
  • 29. Hadoop + PythonEFFICIENCYDon’t stream Java-based languages• Jython• Pig + JythonStreaming has ~30% overhead• Python• MrJob• Pig + Python
  • 30. Hadoop + PythonEXCITED?Well... 90-95% of time isn’t spent onalgos
  • 31. Hadoop + PythonHARD STUFF: SETUPGet Hadoop runningSoftware where it needs to beProcesses communicatingData available
  • 32. Hadoop + PythonHARD STUFF: DEVELOPLearnProject structure, modularityDev environment like Production
  • 33. Hadoop + PythonHARD STUFF: VALIDATESyntax checkPackages availableData readableData writableWithout long waits for failure
  • 34. Hadoop + PythonHARD STUFF: DEBUGDistributed execution is hard to debug
  • 35. Hadoop + PythonHARD STUFF: TESTData processing is hard to testBut critical
  • 36. Hadoop + PythonHARD STUFF: DEPLOYEnvironments identicalCode correctly deployedConfiguration changesNon-disruptive
  • 37. Hadoop + PythonHARD STUFF: HISTORYStats about prior runsWhat code was run?What’s changed?
  • 38. Hadoop + PythonHARD STUFF: LOGSDistributed logs hard to make sense ofHadoop logs hard to understandEphemeral clusters lose logs
  • 39. Hadoop + PythonHARD STUFF: MORTAR’S APPROACHSetup: PaaS, pip installation,connectorsDevelop: learning, structure, instantdev envValidate: fast validateDebug: printf, more comingTest: Rails-like test suitesDeploy: one-button deploy
  • 40. K Young @kky

×