Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to MongoDB and Hadoop

10,130 views

Published on

An Introduction to MongoDB + an Introduction to MongoDB + Hadoop.

This presentation was given at the CT Java Users Group in March 2013.

Published in: Technology

Introduction to MongoDB and Hadoop

  1. 1. #MongoDBIntroduction to MongoDB& MongoDB + HadoopSteve FranciaChief Evangelist, 10gen
  2. 2. What is MongoDB
  3. 3. MongoDB is a ___________database• Document• Open source• High performance• Horizontally scalable• Full featured
  4. 4. Document Database• Not for .PDF & .DOC files• A document is essentially an associative array• Document == JSON object• Document == PHP Array• Document == Python Dict• Document == Ruby Hash• etc
  5. 5. Open Source• MongoDB is an open source project• On GitHub• Licensed under the AGPL• Started & sponsored by 10gen• Commercial licenses available• Contributions welcome
  6. 6. High Performance• Written in C++• Extensive use of memory-mapped files i.e. read-through write-through memory caching.• Runs nearly everywhere• Data serialized as BSON (fast parsing)• Full support for primary & secondary indexes• Document model = less work
  7. 7. Horizontally Scalable
  8. 8. Full Featured• Ad Hoc queries• Real time aggregation• Rich query capabilities• Traditionally consistent• Geospatial features• Support for most programming languages• Flexible schema
  9. 9. Database Landscape
  10. 10. http://www.mongodb.org/downloads
  11. 11. Mongo Shell
  12. 12. Document Database
  13. 13. RDBMS MongoDBTable, View ➜ CollectionRow ➜ DocumentIndex ➜ IndexJoin ➜ Embedded DocumentForeign Key ➜ ReferencePartition ➜ ShardTerminology
  14. 14. Typical (relational) ERD
  15. 15. MongoDB ERD
  16. 16. Working with MongoDB
  17. 17. Creating an author> db.author.insert({ first_name: j.r.r., last_name: tolkien, bio: J.R.R. Tolkien (1892.1973), beloved throughout theworld as the creator of The Hobbit and The Lord of the Rings, was aprofessor of Anglo-Saxon at Oxford, a fellow of PembrokeCollege, and a fellow of Merton College until his retirement in 1959.His chief interest was the linguistic aspects of the early Englishwritten tradition, but even as he studied these classics he wascreating a set of his own.})
  18. 18. Querying for our author> db.author.findOne( { last_name : tolkien } ){ "_id" : ObjectId("507ffbb1d94ccab2da652597"), "first_name" : "j.r.r.", "last_name" : "tolkien", "bio" : "J.R.R. Tolkien (1892.1973), beloved throughout the worldas the creator of The Hobbit and The Lord of the Rings, was aprofessor of Anglo-Saxon at Oxford, a fellow of PembrokeCollege, and a fellow of Merton College until his retirement in 1959.His chief interest was the linguistic aspects of the early Englishwritten tradition, but even as he studied these classics he wascreating a set of his own."}
  19. 19. Creating a Book> db.books.insert({ title: fellowship of the ring, the, author: ObjectId("507ffbb1d94ccab2da652597"), language: english, genre: [fantasy, adventure], publication: { name: george allen & unwin, location: London, date: new Date(21 July 1954), }}) http://society6.com/PastaSoup/The-Fellowship-of-the-Ring-ZZc_Print/
  20. 20. Multiple values per key> db.books.findOne({language: english}, {genre: 1}){ "_id" : ObjectId("50804391d94ccab2da652598"), "genre" : [ "fantasy", "adventure" ]}
  21. 21. Querying for key withmultiple values> db.books.findOne({genre: fantasy}, {title: 1}){ "_id" : ObjectId("50804391d94ccab2da652598"), "title" : "fellowship of the ring, the"} Query key with single value or multiple values the same way.
  22. 22. Nested Values> db.books.findOne({}, {publication: 1}){ "_id" : ObjectId("50804ec7d94ccab2da65259a"), "publication" : { "name" : "george allen & unwin", "location" : "London", "date" : ISODate("1954-07-21T04:00:00Z") }}
  23. 23. Reach into nested valuesusing dot notation> db.books.findOne( {publication.date : { $lt : new Date(21 June 1960)} }){ "_id" : ObjectId("50804391d94ccab2da652598"), "title" : "fellowship of the ring, the", "author" : ObjectId("507ffbb1d94ccab2da652597"), "language" : "english", "genre" : [ "fantasy", "adventure" ], "publication" : { "name" : "george allen & unwin", "location" : "London", "date" : ISODate("1954-07-21T04:00:00Z") }}
  24. 24. Update books> db.books.update( {"_id" : ObjectId("50804391d94ccab2da652598")}, { $set : { isbn: 0547928211, pages: 432 } }) True agile development . Simply change how you work with the data and the database follows
  25. 25. The Updated Book recorddb.books.findOne(){ "_id" : ObjectId("50804ec7d94ccab2da65259a"), "author" : ObjectId("507ffbb1d94ccab2da652597"), "genre" : [ "fantasy", "adventure" ], "isbn" : "0395082544", "language" : "english", "pages" : 432, "publication" : { "name" : "george allen & unwin", "location" : "London", "date" : ISODate("1954-07-21T04:00:00Z") }, "title" : "fellowship of the ring, the"}
  26. 26. Creating indexes> db.books.ensureIndex({title: 1})> db.books.ensureIndex({genre : 1})> db.books.ensureIndex({publication.date: -1})
  27. 27. Finding author by book> book = db.books.findOne( {"title" : "return of the king, the"})> db.author.findOne({_id: book.author}){ "_id" : ObjectId("507ffbb1d94ccab2da652597"), "first_name" : "j.r.r.", "last_name" : "tolkien", "bio" : "J.R.R. Tolkien (1892.1973), beloved throughout the world asthe creator of The Hobbit and The Lord of the Rings, was a professor ofAnglo-Saxon at Oxford, a fellow of Pembroke College, and a fellow ofMerton College until his retirement in 1959. His chief interest was thelinguistic aspects of the early English written tradition, but even as hestudied these classics he was creating a set of his own."}
  28. 28. The Big DataStory
  29. 29. Is actually two stories
  30. 30. Doers & Tellers talking aboutdifferent things http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
  31. 31. Tellers
  32. 32. Doers
  33. 33. Doers talk a lot more aboutactual solutions
  34. 34. They know its a two sidedstory Storage Processing
  35. 35. Take aways• MongoDB and Hadoop• MongoDB for storage & operations• Hadoop for processing & analytics
  36. 36. MongoDB & DataProcessing
  37. 37. Applications have complex needs• MongoDB ideal operational database• MongoDB ideal for BIG data• Not a data processing engine, but provides processing functionality
  38. 38. Many options for ProcessingData• Process in MongoDB using Map Reduce• Process in MongoDB using Aggregation Framework• Process outside MongoDB (using Hadoop)
  39. 39. MongoDB MapReduce
  40. 40. MongoDB Map Reduce• MongoDB map reduce quite capable... but with limits• - Javascript not best language for processing map reduce• - Javascript limited in external data processing libraries• - Adds load to data store
  41. 41. MongoDB Aggregation• Most uses of MongoDB Map Reduce were for aggregation• Aggregation Framework optimized for aggregate queries• Realtime aggregation similar to SQL GroupBy
  42. 42. MongoDB & Hadoop
  43. 43. DEMO• Install Hadoop MongoDB Plugin• Import tweets from twitter• Write mapper• Write reducer• Call myself a data scientist
  44. 44. Installing Mongo- hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
  45. 45. Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
  46. 46. DEMO 1
  47. 47. Map Hashtags in Javapublic class TwitterMapper extends Mapper<Object, BSONObject, Text, IntWritable> { @Override public void map( final Object pKey, final BSONObject pValue, final Context pContext ) throws IOException, InterruptedException{ BSONObject entities = (BSONObject)pValue.get("entities"); if(entities == null) return; BasicBSONList hashtags = (BasicBSONList)entities.get("hashtags"); if(hashtags == null) return; for(Object o : hashtags){ String tag = (String)((BSONObject)o).get("text"); pContext.write( new Text( tag ), new IntWritable( 1 ) ); } }}
  48. 48. Reduce hashtags in Javapublic class TwitterReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce( final Text pKey, final Iterable<IntWritable> pValues, final Context pContext ) throws IOException, InterruptedException{ int count = 0; for ( final IntWritable value : pValues ){ count += value.get(); } pContext.write( pKey, new IntWritable( count ) ); }}
  49. 49. All together#!/bin/shexport HADOOP_HOME="/Users/mike/hadoop/hadoop-1.0.4"declare -a job_argscd ..job_args=("jar" "examples/twitter/target/twitter-example_*.jar")job_args=(${job_args[@]} "com.mongodb.hadoop.examples.twitter.TwitterConfig ")job_args=(${job_args[@]} "-D" "mongo.job.verbose=true")job_args=(${job_args[@]} "-D" "mongo.job.background=false")job_args=(${job_args[@]} "-D" "mongo.input.key=")job_args=(${job_args[@]} "-D" "mongo.input.uri=mongodb://localhost:27017/test.live")job_args=(${job_args[@]} "-D" "mongo.output.uri=mongodb://localhost:27017/test.twit_hashtags")job_args=(${job_args[@]} "-D" "mongo.input.query=")job_args=(${job_args[@]} "-D" "mongo.job.mapper=com.mongodb.hadoop.examples.twitter.TwitterMapper")job_args=(${job_args[@]} "-D" "mongo.job.reducer=com.mongodb.hadoop.examples.twitter.TwitterReducer")job_args=(${job_args[@]} "-D" "mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat")job_args=(${job_args[@]} "-D" "mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat")job_args=(${job_args[@]} "-D" "mongo.job.output.key=org.apache.hadoop.io.Text")job_args=(${job_args[@]} "-D" "mongo.job.output.value=org.apache.hadoop.io.IntWritable")job_args=(${job_args[@]} "-D" "mongo.job.mapper.output.key=org.apache.hadoop.io.Text")job_args=(${job_args[@]} "-D" "mongo.job.mapper.output.value=org.apache.hadoop.io.IntWritable")job_args=(${job_args[@]} "-D" "mongo.job.combiner=com.mongodb.hadoop.examples.twitter.TwitterReducer")job_args=(${job_args[@]} "-D" "mongo.job.partitioner=")job_args=(${job_args[@]} "-D" "mongo.job.sort_comparator=")#echo "${job_args[@]}"$HADOOP_HOME/bin/hadoop "${job_args[@]}" "$1"
  50. 50. Popular HashTagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
  51. 51. DEMO 2
  52. 52. Aggregation in Mongo2.2db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
  53. 53. Popular HashTagsdb.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, ],"ok" : 1}
  54. 54. #MongoDBQuestions?Steve FranciaChief Evangelist, 10gen@spf13Spf13.com

×