• Save
Introduction to MongoDB and Hadoop
Upcoming SlideShare
Loading in...5
×
 

Introduction to MongoDB and Hadoop

on

  • 6,709 views

An Introduction to MongoDB + an Introduction to MongoDB + Hadoop.

An Introduction to MongoDB + an Introduction to MongoDB + Hadoop.

This presentation was given at the CT Java Users Group in March 2013.

Statistics

Views

Total Views
6,709
Views on SlideShare
5,621
Embed Views
1,088

Actions

Likes
22
Downloads
0
Comments
3

4 Embeds 1,088

http://www.scoop.it 1019
https://twitter.com 64
https://www.rebelmouse.com 4
http://gazeta.yandex.ru 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • AGPL – GNU Affero General Public License
  • * Big endian and ARM not supported.
  • Kristine to update this graphic at some point
  • Kristine to update this graphic at some point
  • Kristine to update this graphic at some point
  • Powerful message here. Finally a database that enables rapid & agile development.
  • Creating a book here. A few things to make note of.
  • Powerful message here. Finally a database that enables rapid & agile development.

Introduction to MongoDB and Hadoop Introduction to MongoDB and Hadoop Presentation Transcript

  • #MongoDBIntroduction to MongoDB& MongoDB + HadoopSteve FranciaChief Evangelist, 10gen
  • What is MongoDB
  • MongoDB is a ___________database• Document• Open source• High performance• Horizontally scalable• Full featured
  • Document Database• Not for .PDF & .DOC files• A document is essentially an associative array• Document == JSON object• Document == PHP Array• Document == Python Dict• Document == Ruby Hash• etc
  • Open Source• MongoDB is an open source project• On GitHub• Licensed under the AGPL• Started & sponsored by 10gen• Commercial licenses available• Contributions welcome
  • High Performance• Written in C++• Extensive use of memory-mapped files i.e. read-through write-through memory caching.• Runs nearly everywhere• Data serialized as BSON (fast parsing)• Full support for primary & secondary indexes• Document model = less work
  • Horizontally Scalable
  • Full Featured• Ad Hoc queries• Real time aggregation• Rich query capabilities• Traditionally consistent• Geospatial features• Support for most programming languages• Flexible schema
  • Database Landscape
  • http://www.mongodb.org/downloads
  • Mongo Shell
  • Document Database
  • RDBMS MongoDBTable, View ➜ CollectionRow ➜ DocumentIndex ➜ IndexJoin ➜ Embedded DocumentForeign Key ➜ ReferencePartition ➜ ShardTerminology
  • Typical (relational) ERD
  • MongoDB ERD
  • Working with MongoDB
  • Creating an author> db.author.insert({ first_name: j.r.r., last_name: tolkien, bio: J.R.R. Tolkien (1892.1973), beloved throughout theworld as the creator of The Hobbit and The Lord of the Rings, was aprofessor of Anglo-Saxon at Oxford, a fellow of PembrokeCollege, and a fellow of Merton College until his retirement in 1959.His chief interest was the linguistic aspects of the early Englishwritten tradition, but even as he studied these classics he wascreating a set of his own.})
  • Querying for our author> db.author.findOne( { last_name : tolkien } ){ "_id" : ObjectId("507ffbb1d94ccab2da652597"), "first_name" : "j.r.r.", "last_name" : "tolkien", "bio" : "J.R.R. Tolkien (1892.1973), beloved throughout the worldas the creator of The Hobbit and The Lord of the Rings, was aprofessor of Anglo-Saxon at Oxford, a fellow of PembrokeCollege, and a fellow of Merton College until his retirement in 1959.His chief interest was the linguistic aspects of the early Englishwritten tradition, but even as he studied these classics he wascreating a set of his own."}
  • Creating a Book> db.books.insert({ title: fellowship of the ring, the, author: ObjectId("507ffbb1d94ccab2da652597"), language: english, genre: [fantasy, adventure], publication: { name: george allen & unwin, location: London, date: new Date(21 July 1954), }}) http://society6.com/PastaSoup/The-Fellowship-of-the-Ring-ZZc_Print/
  • Multiple values per key> db.books.findOne({language: english}, {genre: 1}){ "_id" : ObjectId("50804391d94ccab2da652598"), "genre" : [ "fantasy", "adventure" ]}
  • Querying for key withmultiple values> db.books.findOne({genre: fantasy}, {title: 1}){ "_id" : ObjectId("50804391d94ccab2da652598"), "title" : "fellowship of the ring, the"} Query key with single value or multiple values the same way.
  • Nested Values> db.books.findOne({}, {publication: 1}){ "_id" : ObjectId("50804ec7d94ccab2da65259a"), "publication" : { "name" : "george allen & unwin", "location" : "London", "date" : ISODate("1954-07-21T04:00:00Z") }}
  • Reach into nested valuesusing dot notation> db.books.findOne( {publication.date : { $lt : new Date(21 June 1960)} }){ "_id" : ObjectId("50804391d94ccab2da652598"), "title" : "fellowship of the ring, the", "author" : ObjectId("507ffbb1d94ccab2da652597"), "language" : "english", "genre" : [ "fantasy", "adventure" ], "publication" : { "name" : "george allen & unwin", "location" : "London", "date" : ISODate("1954-07-21T04:00:00Z") }}
  • Update books> db.books.update( {"_id" : ObjectId("50804391d94ccab2da652598")}, { $set : { isbn: 0547928211, pages: 432 } }) True agile development . Simply change how you work with the data and the database follows
  • The Updated Book recorddb.books.findOne(){ "_id" : ObjectId("50804ec7d94ccab2da65259a"), "author" : ObjectId("507ffbb1d94ccab2da652597"), "genre" : [ "fantasy", "adventure" ], "isbn" : "0395082544", "language" : "english", "pages" : 432, "publication" : { "name" : "george allen & unwin", "location" : "London", "date" : ISODate("1954-07-21T04:00:00Z") }, "title" : "fellowship of the ring, the"}
  • Creating indexes> db.books.ensureIndex({title: 1})> db.books.ensureIndex({genre : 1})> db.books.ensureIndex({publication.date: -1})
  • Finding author by book> book = db.books.findOne( {"title" : "return of the king, the"})> db.author.findOne({_id: book.author}){ "_id" : ObjectId("507ffbb1d94ccab2da652597"), "first_name" : "j.r.r.", "last_name" : "tolkien", "bio" : "J.R.R. Tolkien (1892.1973), beloved throughout the world asthe creator of The Hobbit and The Lord of the Rings, was a professor ofAnglo-Saxon at Oxford, a fellow of Pembroke College, and a fellow ofMerton College until his retirement in 1959. His chief interest was thelinguistic aspects of the early English written tradition, but even as hestudied these classics he was creating a set of his own."}
  • The Big DataStory
  • Is actually two stories
  • Doers & Tellers talking aboutdifferent things http://www.slideshare.net/siliconangle/trendconnect-big-data-report-september
  • Tellers
  • Doers
  • Doers talk a lot more aboutactual solutions
  • They know its a two sidedstory Storage Processing
  • Take aways• MongoDB and Hadoop• MongoDB for storage & operations• Hadoop for processing & analytics
  • MongoDB & DataProcessing
  • Applications have complex needs• MongoDB ideal operational database• MongoDB ideal for BIG data• Not a data processing engine, but provides processing functionality
  • Many options for ProcessingData• Process in MongoDB using Map Reduce• Process in MongoDB using Aggregation Framework• Process outside MongoDB (using Hadoop)
  • MongoDB MapReduce
  • MongoDB Map Reduce• MongoDB map reduce quite capable... but with limits• - Javascript not best language for processing map reduce• - Javascript limited in external data processing libraries• - Adds load to data store
  • MongoDB Aggregation• Most uses of MongoDB Map Reduce were for aggregation• Aggregation Framework optimized for aggregate queries• Realtime aggregation similar to SQL GroupBy
  • MongoDB & Hadoop
  • DEMO• Install Hadoop MongoDB Plugin• Import tweets from twitter• Write mapper• Write reducer• Call myself a data scientist
  • Installing Mongo- hadoop https://gist.github.com/1887726hadoop_version 0.23hadoop_path="/usr/local/Cellar/hadoop/$hadoop_version.0/libexec/lib"git clone git://github.com/mongodb/mongo-hadoop.gitcd mongo-hadoopsed -i "s/default/$hadoop_version/g" build.sbtcd streaming./build.sh
  • Groking Twittercurl https://stream.twitter.com/1/statuses/sample.json -u<login>:<password> | mongoimport -d test -c live ... let it run for about 2 hours
  • DEMO 1
  • Map Hashtags in Javapublic class TwitterMapper extends Mapper<Object, BSONObject, Text, IntWritable> { @Override public void map( final Object pKey, final BSONObject pValue, final Context pContext ) throws IOException, InterruptedException{ BSONObject entities = (BSONObject)pValue.get("entities"); if(entities == null) return; BasicBSONList hashtags = (BasicBSONList)entities.get("hashtags"); if(hashtags == null) return; for(Object o : hashtags){ String tag = (String)((BSONObject)o).get("text"); pContext.write( new Text( tag ), new IntWritable( 1 ) ); } }}
  • Reduce hashtags in Javapublic class TwitterReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce( final Text pKey, final Iterable<IntWritable> pValues, final Context pContext ) throws IOException, InterruptedException{ int count = 0; for ( final IntWritable value : pValues ){ count += value.get(); } pContext.write( pKey, new IntWritable( count ) ); }}
  • All together#!/bin/shexport HADOOP_HOME="/Users/mike/hadoop/hadoop-1.0.4"declare -a job_argscd ..job_args=("jar" "examples/twitter/target/twitter-example_*.jar")job_args=(${job_args[@]} "com.mongodb.hadoop.examples.twitter.TwitterConfig ")job_args=(${job_args[@]} "-D" "mongo.job.verbose=true")job_args=(${job_args[@]} "-D" "mongo.job.background=false")job_args=(${job_args[@]} "-D" "mongo.input.key=")job_args=(${job_args[@]} "-D" "mongo.input.uri=mongodb://localhost:27017/test.live")job_args=(${job_args[@]} "-D" "mongo.output.uri=mongodb://localhost:27017/test.twit_hashtags")job_args=(${job_args[@]} "-D" "mongo.input.query=")job_args=(${job_args[@]} "-D" "mongo.job.mapper=com.mongodb.hadoop.examples.twitter.TwitterMapper")job_args=(${job_args[@]} "-D" "mongo.job.reducer=com.mongodb.hadoop.examples.twitter.TwitterReducer")job_args=(${job_args[@]} "-D" "mongo.job.input.format=com.mongodb.hadoop.MongoInputFormat")job_args=(${job_args[@]} "-D" "mongo.job.output.format=com.mongodb.hadoop.MongoOutputFormat")job_args=(${job_args[@]} "-D" "mongo.job.output.key=org.apache.hadoop.io.Text")job_args=(${job_args[@]} "-D" "mongo.job.output.value=org.apache.hadoop.io.IntWritable")job_args=(${job_args[@]} "-D" "mongo.job.mapper.output.key=org.apache.hadoop.io.Text")job_args=(${job_args[@]} "-D" "mongo.job.mapper.output.value=org.apache.hadoop.io.IntWritable")job_args=(${job_args[@]} "-D" "mongo.job.combiner=com.mongodb.hadoop.examples.twitter.TwitterReducer")job_args=(${job_args[@]} "-D" "mongo.job.partitioner=")job_args=(${job_args[@]} "-D" "mongo.job.sort_comparator=")#echo "${job_args[@]}"$HADOOP_HOME/bin/hadoop "${job_args[@]}" "$1"
  • Popular HashTagsdb.twit_hashtags.find().sort( {count : -1 }){ "_id" : "YouKnowYoureInLoveIf", "count" : 287 }{ "_id" : "teamfollowback", "count" : 200 }{ "_id" : "RT", "count" : 150 }{ "_id" : "Arsenal", "count" : 148 }{ "_id" : "milars", "count" : 145 }{ "_id" : "sanremo", "count" : 145 }{ "_id" : "LoseMyNumberIf", "count" : 139 }{ "_id" : "RelationshipsShould", "count" : 137 }{ "_id" : "oomf", "count" : 117 }{ "_id" : "TeamFollowBack", "count" : 105 }{ "_id" : "WhyDoPeopleThink", "count" : 102 }{ "_id" : "np", "count" : 100 }
  • DEMO 2
  • Aggregation in Mongo2.2db.live.aggregate( { $unwind : "$entities.hashtags" } , { $match : { "entities.hashtags.text" : { $exists : true } } } , { $group : { _id : "$entities.hashtags.text", count : { $sum : 1 } } } , { $sort : { count : -1 } }, { $limit : 10 })
  • Popular HashTagsdb.twit_hashtags.aggregate(a){ "result" : [ { "_id" : "YouKnowYoureInLoveIf", "count" : 287 }, { "_id" : "teamfollowback", "count" : 200 }, { "_id" : "RT", "count" : 150 }, { "_id" : "Arsenal", "count" : 148 }, { "_id" : "milars", "count" : 145 }, { "_id" : "sanremo","count" : 145 }, { "_id" : "LoseMyNumberIf", "count" : 139 }, { "_id" : "RelationshipsShould", "count" : 137 }, ],"ok" : 1}
  • #MongoDBQuestions?Steve FranciaChief Evangelist, 10gen@spf13Spf13.com