MongoDB + PigJeremy Karn - co-founder, Mortar
OverviewOF THIS SESSIONIntro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: stor...
HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
HadoopRAPID OVERVIEW
HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
HadoopSTRENGTHSScalableOpen sourceLots of momentumVery broadly applicable
Social Graph
Predict
Detect
Genetics
HadoopPROBLEMSDifficultBatch only (...or it was)
HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
Alternatives to HadoopMONGODB NATIVE MAPREDUCEWrite MapReduce in Javascript• Javascript is not fast• Has limited data type...
Alternatives to HadoopMONGODB NATIVE MAPREDUCEHadoop has libs for• Machine learning• ETL• Can access any JVM analytic libs...
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKGreat when• Doing SQL-style aggregation• Do not require external data l...
Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKBut you may want Hadoop when• Doing sophisticated aggregation• Require ...
PigON HADOOPLess codeExpressive code
PigBRIEF, EXPRESSIVELIKE PROCEDURAL SQL(thanks: twitter hadoop world presentation)
The Same Script, In MapReduceFOR SERIOUS
PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
MongoDB + PigMOTIVATIONSData storage and data processing are oftenseparate concernsHadoop is built for scalable processing...
MongoDB, PigSIMILAR STANCEPoly-structured data• MongoDB: stores data, regardless of  structure• Pig: reads data, regardles...
MongoDB, PigJSON-PIG DATA TYPE MAPPING    JSON           Pigstring        chararrayinteger       intboolean       booleand...
MongoDB, PigMONGODB-PIG DATA TYPE MAPPING  MongoDB         Pigdate         datetimeobject id    chararraybinary       byte...
MortarFAST INTROOpen-source code-based dev framework fordata, built on Hadoop and PigInspired by RailsSelf-contained, orga...
> gem install mortar
> mortar new my_project
MortarFAST INTROOur service hosts and executes mortarprojects
> mortar jobs:run your_pigscript--clustersize 5
MortarFAST INTROBrowser-only interface, great fordemonstrating Hadoop
MongoDB, PigLOADING DATAOne requirement:• Must specify top level fields to load from  the mongoDB collection.Optional:• Sp...
MongoDB, PigLOADING DATA - ENRON DATA{    "body": "the ... person...",    "subFolder": "notes_inbox",    "mailbox": "bass-...
MongoDB, PigSCRIPT DEMO
MongoDB, PigSTORE STATEMENTThe MongoStorage function takes an optionallist of arguments of two types: • A single set of ke...
PigILLUSTRATEAuto-select datasetExercise every execution pathStep-by-step execution
PigWHY ILLUSTRATEWrite correct code quicklyUnderstand others’ codeTest every execution path, every step
PigUSER-DEFINED FUNCTIONS (UDF)Pig is like procedural SQLUDFs for rich data manipulationUDFs: Java-based languageWe made P...
MongoDB + PigWITHOUT MORTARGet the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop
MongoDB + PigSUMMARYHadoop and friends are maturingMongoDB and Pig are philosophicallyalignedReading and writing to Pig is...
MongoDB + Pig on Hadoop (MongoSV 2012)
MongoDB + Pig on Hadoop (MongoSV 2012)
Upcoming SlideShare
Loading in...5
×

MongoDB + Pig on Hadoop (MongoSV 2012)

6,172

Published on

Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig.

Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,172
On Slideshare
0
From Embeds
0
Number of Embeds
9
Actions
Shares
0
Downloads
63
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Transcript of "MongoDB + Pig on Hadoop (MongoSV 2012)"

  1. 1. MongoDB + PigJeremy Karn - co-founder, Mortar
  2. 2. OverviewOF THIS SESSIONIntro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB
  3. 3. HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  4. 4. HadoopRAPID OVERVIEW
  5. 5. HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  6. 6. HadoopSTRENGTHSScalableOpen sourceLots of momentumVery broadly applicable
  7. 7. Social Graph
  8. 8. Predict
  9. 9. Detect
  10. 10. Genetics
  11. 11. HadoopPROBLEMSDifficultBatch only (...or it was)
  12. 12. HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  13. 13. Alternatives to HadoopMONGODB NATIVE MAPREDUCEWrite MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store
  14. 14. Alternatives to HadoopMONGODB NATIVE MAPREDUCEHadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop
  15. 15. Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKGreat when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework
  16. 16. Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKBut you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore
  17. 17. PigON HADOOPLess codeExpressive code
  18. 18. PigBRIEF, EXPRESSIVELIKE PROCEDURAL SQL(thanks: twitter hadoop world presentation)
  19. 19. The Same Script, In MapReduceFOR SERIOUS
  20. 20. PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  21. 21. MongoDB + PigMOTIVATIONSData storage and data processing are oftenseparate concernsHadoop is built for scalable processing oflarge datasets
  22. 22. MongoDB, PigSIMILAR STANCEPoly-structured data• MongoDB: stores data, regardless of structure• Pig: reads data, regardless of structure (got its name because Pigs are omnivorous)
  23. 23. MongoDB, PigJSON-PIG DATA TYPE MAPPING JSON Pigstring chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null
  24. 24. MongoDB, PigMONGODB-PIG DATA TYPE MAPPING MongoDB Pigdate datetimeobject id chararraybinary bytearraydataregexp chararraycode chararray
  25. 25. MortarFAST INTROOpen-source code-based dev framework fordata, built on Hadoop and PigInspired by RailsSelf-contained, organized, executableprojects
  26. 26. > gem install mortar
  27. 27. > mortar new my_project
  28. 28. MortarFAST INTROOur service hosts and executes mortarprojects
  29. 29. > mortar jobs:run your_pigscript--clustersize 5
  30. 30. MortarFAST INTROBrowser-only interface, great fordemonstrating Hadoop
  31. 31. MongoDB, PigLOADING DATAOne requirement:• Must specify top level fields to load from the mongoDB collection.Optional:• Specify a subset of embedded fields• Data type for any/all fields
  32. 32. MongoDB, PigLOADING DATA - ENRON DATA{    "body": "the ... person...",    "subFolder": "notes_inbox",    "mailbox": "bass-e",    "filename": "450.",    "headers": {        "From": "michael.simmons@enron.com",        "To": "tim_belden@enron.com", “Subject”: “Subject”        "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",    }}
  33. 33. MongoDB, PigSCRIPT DEMO
  34. 34. MongoDB, PigSTORE STATEMENTThe MongoStorage function takes an optionallist of arguments of two types: • A single set of keys to base updating on. This has three options: None, update, or multi. • Multiple indexes to ensure in the same format as db.col.ensureIndex().
  35. 35. PigILLUSTRATEAuto-select datasetExercise every execution pathStep-by-step execution
  36. 36. PigWHY ILLUSTRATEWrite correct code quicklyUnderstand others’ codeTest every execution path, every step
  37. 37. PigUSER-DEFINED FUNCTIONS (UDF)Pig is like procedural SQLUDFs for rich data manipulationUDFs: Java-based languageWe made Pig work with CPython (NumPy,etc)
  38. 38. MongoDB + PigWITHOUT MORTARGet the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop
  39. 39. MongoDB + PigSUMMARYHadoop and friends are maturingMongoDB and Pig are philosophicallyalignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible• work is offloaded• external libraries available
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×