MongoDB + Pig on Hadoop (MongoSV 2012)


Published on

Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig.

Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

MongoDB + Pig on Hadoop (MongoSV 2012)

  1. 1. MongoDB + PigJeremy Karn - co-founder, Mortar
  2. 2. OverviewOF THIS SESSIONIntro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB
  3. 3. HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  4. 4. HadoopRAPID OVERVIEW
  5. 5. HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  6. 6. HadoopSTRENGTHSScalableOpen sourceLots of momentumVery broadly applicable
  7. 7. Social Graph
  8. 8. Predict
  9. 9. Detect
  10. 10. Genetics
  11. 11. HadoopPROBLEMSDifficultBatch only (...or it was)
  12. 12. HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  13. 13. Alternatives to HadoopMONGODB NATIVE MAPREDUCEWrite MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store
  14. 14. Alternatives to HadoopMONGODB NATIVE MAPREDUCEHadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop
  15. 15. Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKGreat when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework
  16. 16. Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKBut you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore
  17. 17. PigON HADOOPLess codeExpressive code
  18. 18. PigBRIEF, EXPRESSIVELIKE PROCEDURAL SQL(thanks: twitter hadoop world presentation)
  19. 19. The Same Script, In MapReduceFOR SERIOUS
  20. 20. PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  21. 21. MongoDB + PigMOTIVATIONSData storage and data processing are oftenseparate concernsHadoop is built for scalable processing oflarge datasets
  22. 22. MongoDB, PigSIMILAR STANCEPoly-structured data• MongoDB: stores data, regardless of structure• Pig: reads data, regardless of structure (got its name because Pigs are omnivorous)
  23. 23. MongoDB, PigJSON-PIG DATA TYPE MAPPING JSON Pigstring chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null
  24. 24. MongoDB, PigMONGODB-PIG DATA TYPE MAPPING MongoDB Pigdate datetimeobject id chararraybinary bytearraydataregexp chararraycode chararray
  25. 25. MortarFAST INTROOpen-source code-based dev framework fordata, built on Hadoop and PigInspired by RailsSelf-contained, organized, executableprojects
  26. 26. > gem install mortar
  27. 27. > mortar new my_project
  28. 28. MortarFAST INTROOur service hosts and executes mortarprojects
  29. 29. > mortar jobs:run your_pigscript--clustersize 5
  30. 30. MortarFAST INTROBrowser-only interface, great fordemonstrating Hadoop
  31. 31. MongoDB, PigLOADING DATAOne requirement:• Must specify top level fields to load from the mongoDB collection.Optional:• Specify a subset of embedded fields• Data type for any/all fields
  32. 32. MongoDB, PigLOADING DATA - ENRON DATA{    "body": "the ... person...",    "subFolder": "notes_inbox",    "mailbox": "bass-e",    "filename": "450.",    "headers": {        "From": "",        "To": "", “Subject”: “Subject”        "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",    }}
  33. 33. MongoDB, PigSCRIPT DEMO
  34. 34. MongoDB, PigSTORE STATEMENTThe MongoStorage function takes an optionallist of arguments of two types: • A single set of keys to base updating on. This has three options: None, update, or multi. • Multiple indexes to ensure in the same format as db.col.ensureIndex().
  35. 35. PigILLUSTRATEAuto-select datasetExercise every execution pathStep-by-step execution
  36. 36. PigWHY ILLUSTRATEWrite correct code quicklyUnderstand others’ codeTest every execution path, every step
  37. 37. PigUSER-DEFINED FUNCTIONS (UDF)Pig is like procedural SQLUDFs for rich data manipulationUDFs: Java-based languageWe made Pig work with CPython (NumPy,etc)
  38. 38. MongoDB + PigWITHOUT MORTARGet the mongo-hadoop connector:
  39. 39. MongoDB + PigSUMMARYHadoop and friends are maturingMongoDB and Pig are philosophicallyalignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible• work is offloaded• external libraries available