MongoDB + Pig on Hadoop (MongoSV 2012)
Upcoming SlideShare
Loading in...5
×
 

MongoDB + Pig on Hadoop (MongoSV 2012)

on

  • 6,689 views

Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig. ...

Slides from Mortar co-founder Jeremy Karn's presentation at MongoSV 2012. Learn to process Mongo data with Hadoop—specifically with Apache Pig.

Jeremy's presentation covered the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo. This talk will demonstrate its concepts with Mortar, which has contributed to the Mongo Hadoop connector, extending it to work with Pig.

Statistics

Views

Total Views
6,689
Views on SlideShare
3,377
Embed Views
3,312

Actions

Likes
5
Downloads
55
Comments
0

14 Embeds 3,312

http://blog.mortardata.com 3208
http://www.google.com 20
http://www.tumblr.com 15
http://severian14.okeedo.com 13
http://safe.txmblr.com 13
http://www.soso.com 11
http://yandex.ru 7
http://search.yahoo.com 5
http://www.bing.com 5
http://safe.tumblr.com 5
http://newsblur.com 4
http://www.newsblur.com 4
http://www.twylah.com 1
http://webcache.googleusercontent.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

MongoDB + Pig on Hadoop (MongoSV 2012) MongoDB + Pig on Hadoop (MongoSV 2012) Presentation Transcript

  • MongoDB + PigJeremy Karn - co-founder, Mortar
  • OverviewOF THIS SESSIONIntro to HadoopIntro to PigWhy MongoDB + Pig?Demo: load PigDemo: processing data with PigDemo: store data from Pig to MongoDB
  • HadoopRAPID OVERVIEWMapReduce programming modelfrom Google(Jeff Dean and Sanjay Ghemawat)
  • HadoopRAPID OVERVIEW
  • HadoopRAPID OVERVIEWHadoop implements MapReduce (Java)(Doug Cutting)Incubated at YahooIndexing, Spam detection, more
  • HadoopSTRENGTHSScalableOpen sourceLots of momentumVery broadly applicable
  • Social Graph
  • Predict
  • Detect
  • Genetics
  • HadoopPROBLEMSDifficultBatch only (...or it was)
  • HadoopFUTUREYarnMapReduce optionalGeneric management + distributedappsImpala
  • Alternatives to HadoopMONGODB NATIVE MAPREDUCEWrite MapReduce in Javascript• Javascript is not fast• Has limited data types• Hard to use complex analytic libsAdds load to data store
  • Alternatives to HadoopMONGODB NATIVE MAPREDUCEHadoop has libs for• Machine learning• ETL• Can access any JVM analytic libsAnd many organizations already use Hadoop
  • Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKGreat when• Doing SQL-style aggregation• Do not require external data libs• Users will learn framework
  • Alternatives to HadoopMONGODB AGGREGATION FRAMEWORKBut you may want Hadoop when• Doing sophisticated aggregation• Require external data libs• Users unwilling to learn framework• Need to transfer workload off datastore
  • PigON HADOOPLess codeExpressive code
  • PigBRIEF, EXPRESSIVELIKE PROCEDURAL SQL(thanks: twitter hadoop world presentation)
  • The Same Script, In MapReduceFOR SERIOUS
  • PigON HADOOPLess codeExpressive codeCompiles to MRInsulates from APIPopular(LinkedIn, Twitter,Salesforce, Yahoo,Stanford
  • MongoDB + PigMOTIVATIONSData storage and data processing are oftenseparate concernsHadoop is built for scalable processing oflarge datasets
  • MongoDB, PigSIMILAR STANCEPoly-structured data• MongoDB: stores data, regardless of structure• Pig: reads data, regardless of structure (got its name because Pigs are omnivorous)
  • MongoDB, PigJSON-PIG DATA TYPE MAPPING JSON Pigstring chararrayinteger intboolean booleandouble doublearray bagobject map/tuplenull null
  • MongoDB, PigMONGODB-PIG DATA TYPE MAPPING MongoDB Pigdate datetimeobject id chararraybinary bytearraydataregexp chararraycode chararray
  • MortarFAST INTROOpen-source code-based dev framework fordata, built on Hadoop and PigInspired by RailsSelf-contained, organized, executableprojects
  • > gem install mortar
  • > mortar new my_project
  • MortarFAST INTROOur service hosts and executes mortarprojects
  • > mortar jobs:run your_pigscript--clustersize 5
  • MortarFAST INTROBrowser-only interface, great fordemonstrating Hadoop
  • MongoDB, PigLOADING DATAOne requirement:• Must specify top level fields to load from the mongoDB collection.Optional:• Specify a subset of embedded fields• Data type for any/all fields
  • MongoDB, PigLOADING DATA - ENRON DATA{    "body": "the ... person...",    "subFolder": "notes_inbox",    "mailbox": "bass-e",    "filename": "450.",    "headers": {        "From": "michael.simmons@enron.com",        "To": "tim_belden@enron.com", “Subject”: “Subject”        "Date": "Mon, 14 May 2001 16:39:00 -0700 (PDT)",    }}
  • MongoDB, PigSCRIPT DEMO
  • MongoDB, PigSTORE STATEMENTThe MongoStorage function takes an optionallist of arguments of two types: • A single set of keys to base updating on. This has three options: None, update, or multi. • Multiple indexes to ensure in the same format as db.col.ensureIndex().
  • PigILLUSTRATEAuto-select datasetExercise every execution pathStep-by-step execution
  • PigWHY ILLUSTRATEWrite correct code quicklyUnderstand others’ codeTest every execution path, every step
  • PigUSER-DEFINED FUNCTIONS (UDF)Pig is like procedural SQLUDFs for rich data manipulationUDFs: Java-based languageWe made Pig work with CPython (NumPy,etc)
  • MongoDB + PigWITHOUT MORTARGet the mongo-hadoop connector:http://github.com/mongodb/mongo-hadoop
  • MongoDB + PigSUMMARYHadoop and friends are maturingMongoDB and Pig are philosophicallyalignedReading and writing to Pig is straightforwardOnce in Pig (Hadoop)• massive batch calcs / analytics possible• work is offloaded• external libraries available