MongoDB Basics

1,053 views

Published on

MongoDB workshop given by me at MIT, Pune. This PDF has example of how to design mongodb schema as per application usage.

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,053
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
32
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

MongoDB Basics

  1. 1. Sarang Shravagi Python Developer,ScaleArc @_sarangs
  2. 2. Let’s Know Each Other • Why are you attending? • Do you code? • OS? • Programing Language? • JSON? • MongoDB?
  3. 3. Agenda • SQL and NoSQL Database • What is MongoDB? • Hands-On and Assignment • Design Models • MongoDB Language Driver • Disaster Recovery • Handling BigData
  4. 4. Data Patterns & Storage Needs • Product Information • User Information • Purchase Information • Product Reviews • Site Interactions • Social Graph • Search Index
  5. 5. SQL to NoSQL Design Paradigm Shift
  6. 6. Database Evolution
  7. 7. SQL Storage • Was designed when – Storage and data transfer was costly – Processing was slow – Applications were oriented more towards data collection • Initial adopters were financial institutions
  8. 8. SQL Storage • Structured – schema • Relational – foreign keys, constraints • Transactional – Atomicity, Consistency, Isolation, Durability • High Availability through robustness – Minimize failures • Optimized for Writes • Typically Scale Up
  9. 9. NoSQL Storage • Is designed when – Storage is cheap – Data transfer is fast – Much more processing power is available • Clustering of machines is also possible – Applications are oriented towards consumption of User Generated Content – Better on-screen user experience is in demand
  10. 10. NoSQL Storage • Semi-structured – Schemaless • Consistency,Availability, Partition Tolerance • High Availability through clustering – expect failures • Optimized for Reads • Typically Scale Out
  11. 11. Different Databases Half Level Deep
  12. 12. SQL: RDBMS • MySql, Postgresql, Oracle etc. • Stores data in tables having columns – Basic (number, text) data types • Strong query language • Transparent values – Query language can read and filter on them – Relationship between tables based on values • Suited for user info and transactions
  13. 13. NoSQL Data Model
  14. 14. NoSQL: Key/Value
  15. 15. NoSQL: Document • MongoDB, CouchDB etc. • Object Oriented data models – Stores data in document objects having fields – Basic and compound (list, dict) data types • SQL like queries • Transparent values – Can be part of query • Suited for product info and its reviews
  16. 16. NoSQL: Document
  17. 17. NoSQL: Column Family • Cassandra, Big Table etc. • Stores data in columns • Transparent values – Can be part of query • SQL like queries • Suited for search
  18. 18. NoSQL: Graph • Neo4j • Stores data in form of nodes and relationships • Query is in form of traversal • In-memory • Suited for social graph
  19. 19. NoSQL: Graph
  20. 20. What is MongoDB?
  21. 21. MongoDB is a ___________ database 1. Document 2. Open source 3. High performance 4. Horizontally scalable 5. Full featured
  22. 22. 1. Document Database • Not for .PDF & .DOC files • Adocument is essentially an associative array • Document = JSON object • Document = PHPArray • Document = Python Dict • Document = Ruby Hash • etc
  23. 23. Database Landscape
  24. 24. 2. Open Source • MongoDB is an open source project • On GitHub • Licensed under theAGPL • Started & sponsored by MongoDB Inc (formerly known as 10gen) • Commercial licenses available • Contributions welcome
  25. 25. 7,000,000+ MongoDB Downloads 150,000+ Online Education Registrants 35,000+ MongoDB Management Service (MMS) Users 30,000+ MongoDB User Group Members 20,000+ MongoDB DaysAttendees Global Community
  26. 26. 3. High Performance • Written in C++ • Extensive use of memory-mapped files i.e. read-through write-through memory caching. • Runs nearly everywhere • Data serialized as BSON (fast parsing) • Full support for primary & secondary indexes • Document model = less work
  27. 27. Better Data Locality Performance In-Memory Caching In-Place Updates
  28. 28. 4. Scalability Auto-Sharding • Increase capacity as you go • Commodity and cloud architectures • Improved operational simplicity and cost visibility
  29. 29. High Availability • Automated replication and failover • Multi-data center support • Improved operational simplicity (e.g., HW swaps) • Data durability and consistency
  30. 30. Scalability: MongoDB Architecture
  31. 31. 5. Full Featured • Ad Hoc queries • Real time aggregation • Rich query capabilities • Strongly consistent • Geospatial features • Support for most programming languages • Flexible schema
  32. 32. MongoDB is Fully Featured
  33. 33. MongoDB Architecture
  34. 34. Terminology
  35. 35. Do More With Your Data MongoDB Rich Queries • Find Paul’s cars • Find everybody in London with a car built between 1970 and 1980 Geospatial • Find all of the car owners within 5km of Trafalgar Sq. Text Search • Find all the cars described as having leather seats Aggregation • Calculate the average value of Paul’s car collection Map Reduce • What is the ownership pattern of colors by geography over time? (is purple trending up in China?) { first_name: ‘Paul’, surname: ‘Miller’, city: ‘London’, location: [45.123,47.232], cars: [ { model: ‘Bentley’, year: 1973, value: 100000, … }, { model: ‘Rolls Royce’, year: 1965, value: 330000, … } } }
  36. 36. Hands-On & Assignment
  37. 37. mongodb.org/downloads
  38. 38. $ tar –zxvf mongodb-osx-x86_64-2.6.0.tgz $ cd mongodb-osx-i386-2.6.0/bin $ mkdir –p /data/db $ ./mongod Running MongoDB
  39. 39. MongoDB: Core Binaries • mongod – Database server • mongo – Database client shell • mongos – Router for Sharding
  40. 40. Getting Help • For mongo shell – mongo –help • Shows options available for running the shell • Inside mongo shell – db.help() • Shows commands available on the object
  41. 41. Database Operations • Database creation • Creating/changing collection • Data insertion • Data read • Data update • Creating indices • Data deletion • Dropping collection
  42. 42. MacBook-Pro-:~ $ mongo MongoDB shell version: 2.6.0 connecting to: test > db.cms.insert({text: 'Welcome to MongoDB'}) > db.cms.find().pretty() { "_id" : ObjectId("51c34130fbd5d7261b4cdb55"), "text" : "Welcome to MongoDB" } Mongo Shell
  43. 43. Diagnostic Tools • mongostat • mongoperf • mongosnif • mongotop
  44. 44. Import Export Tools • For objects – mongodump – mongorestore – bsondump – mongooplog • For data items – mongoimport – mongoexport
  45. 45. Assignment • Tasks – assignments.txt • Data – students.json
  46. 46. Questions?
  47. 47. Sarang Shravagi @_sarangs Thank You
  48. 48. Design Models
  49. 49. First step in any application is Determine your entities
  50. 50. Entities in our Blogging System • Users (post authors) • Article • Comments • Tags, Category • Interactions (views, clicks)
  51. 51. In a relational base app We would start by doing schema design
  52. 52. Typical (relational) ERD
  53. 53. In a MongoDB based app We start building our app and let the schema evolve
  54. 54. MongoDB ERD
  55. 55. Seek = 5+ ms Read = really really fast Post Author Comment Disk seeks and data locality
  56. 56. Post Author Comment Comment Comment Comment Comment Disk seeks and data locality
  57. 57. MongoDB Language Driver
  58. 58. Real applications are not built in the shell
  59. 59. MongoDB has native bindings for over 12 languages
  60. 60. Drivers & Ecosystem Drivers Support for the most popular languages and frameworks Frameworks Morphia MEAN Stack Java Python Perl Ruby
  61. 61. Working With MongoDB
  62. 62. # Python dictionary (or object) >>> article = { ‘title’ : ‘Schema design in MongoDB’, ‘author’ : ‘sarangs’, ‘section’ : ‘schema’, ‘slug’ : ‘schema-design-in-mongodb’, ‘text’ : ‘Data in MongoDB has a flexible schema. So, 2 documents needn’t have same structure. It allows implicit schema to evolve.’, ‘date’ : datetime.utcnow(), ‘tags’ : [‘MongoDB’, ‘schema’] } >>> db[‘articles’].insert(article) Design schema.. In application code
  63. 63. >>> img_data = Binary(open(‘article_img.jpg’).read()) >>> article = { ‘title’ : ‘Schema evolutionin MongoDB’, ‘author’ : ‘mattbates’, ‘section’ : ‘schema’, ‘slug’ : ‘schema-evolution-in-mongodb’, ‘text’ : ‘MongoDb has dynamic schema. For good performance, you would need an implicit structure and indexes’, ‘date’ : datetime.utcnow(), ‘tags’ : [‘MongoDB’, ‘schema’, ‘migration’], ‘headline_img’ : { ‘img’ : img_data, ‘caption’ : ‘A sample document at the shell’ }} Let’s add a headline image
  64. 64. >>> article = { ‘title’ : ‘Favourite web application framework’, ‘author’ : ‘sarangs’, ‘section’ : ‘web-dev’, ‘slug’ : ‘web-app-frameworks’, ‘gallery’ : [ { ‘img_url’ : ‘http://x.com/45rty’, ‘caption’ : ‘Flask’, ..}, .. ] ‘date’ : datetime.utcnow(), ‘tags’ : [‘Python’, ‘web’], } >>> db[‘articles’].insert(article) And different types of article
  65. 65. >>> user = { 'user' : 'sarangs', 'email' : ‘sarang.shravagi@gmail.com', 'password' : ‘sarang', 'joined' : datetime.utcnow(), 'location' : { 'city' : 'Mumbai' }, } } >>> db[‘users’].insert(user) Users and profiles
  66. 66. Modelling comments (1) • Two collections – articles and comments • Use a reference (i.e. foreign key) to link together • But.. N+1 queries to retrieve article and comments { ‘_id’: ObjectId(..), ‘title’: ‘Schema design in MongoDB’, ‘author’: ‘mattbates’, ‘date’: ISODate(..), ‘tags’: [‘MongoDB’, ‘schema’], ‘section’: ‘schema’, ‘slug’: ‘schema-design-in-mongodb’, ‘comments’: [ ObjectId(..),…] } { ‘_id’: ObjectId(..), ‘article_id’: 1, ‘text’: ‘Agreat article, helped me understand schema design’, ‘date’: ISODate(..),, ‘author’: ‘johnsmith’ }
  67. 67. Modelling comments (2) • Single articles collection – embed comments in article documents • Pros • Single query, document designed for the access pattern • Locality (disk, shard) • Cons • Comments array is unbounded; documents will grow in size (remember 16MB document limit) { ‘_id’: ObjectId(..), ‘title’: ‘Schema design in MongoDB’, ‘author’: ‘mattbates’, ‘date’: ISODate(..), ‘tags’: [‘MongoDB’, ‘schema’], … ‘comments’: [ { ‘text’: ‘Agreat article, helped me understandschema design’, ‘date’: ISODate(..), ‘author’: ‘johnsmith’ }, … ] }
  68. 68. Modelling comments (3) • Another option: hybrid of (2) and (3), embed top x comments (e.g. by date, popularity) into the article document • Fixed-size (2.4 feature) comments array • All other comments ‘overflow’ into a comments collection (double write) in buckets • Pros – Document size is more fixed – fewer moves – Single query built – Full comment history with rich query/aggregation
  69. 69. Modelling comments (3) { ‘_id’: ObjectId(..), ‘title’: ‘Schemadesignin MongoDB’, ‘author’: ‘mattbates’, ‘date’: ISODate(..), ‘tags’:[‘MongoDB’, ‘schema’], … ‘comments_count’: 45, ‘comments_pages’: 1 ‘comments’: [ { ‘text’: ‘Agreat article, helped me understandschema design’, ‘date’: ISODate(..), ‘author’: ‘johnsmith’ }, … ] } Total number of comments • Integer counter updated by update operation as comments added/removed Number of pages • Page is a bucket of 100 comments (see next slide..) Fixed-size comments array • 10 most recent • Sorted by date on insertion
  70. 70. Modelling comments (3) { ‘_id’: ObjectId(..), ‘article_id’: ObjectId(..), ‘page’: 1, ‘count’: 42 ‘comments’: [ { ‘text’: ‘Agreat article, helped me understand schema design’, ‘date’: ISODate(..), ‘author’: ‘johnsmith’ }, … } One comment bucket (page) document containing up to about 100 comments Array of 100 comment sub- documents
  71. 71. Modelling interactions • Interactions – Article views – Comments – (Social media sharing) • Requirements – Time series – Pre-aggregated in preparation for analytics
  72. 72. Modelling interactions • Document per article per day – ‘bucketing’ • Daily counter and hourly sub- document counters for interactions • Bounded array (24 hours) • Single query to retrieve daily article interactions; ready-made for graphing and further aggregation { ‘_id’: ObjectId(..), ‘article_id’: ObjectId(..), ‘section’: ‘schema’, ‘date’: ISODate(..), ‘daily’: { ‘views’: 45, ‘comments’: 150 } ‘hours’: { 0 : { ‘views’: 10 }, 1 : { ‘views’: 2 }, … 23 : { ‘comments’: 14, ‘views’: 10 } } }
  73. 73. JSON and RESTful API Client-side JSON (eg AngularJS, (BSON) Real applications are not built at a shell – let’s build a RESTful API. Pymongo driver Python web app HTTP(S) REST Examples to follow: Python RESTful API using Flask microframework
  74. 74. myCMS REST endpoints Method URI Action GET /articles Retrieve all articles GET /articles-by-tag/[tag] Retrieve all articles by tag GET /articles/[article_id] Retrieve a specific article by article_id POST /articles Add a new article GET /articles/[article_id]/comments Retrieve all article comments by article_id POST /articles/[article_id]/comments Add a new comment to an article. POST /users Register a user user GET /users/[username] Retrieve user’s profile PUT /users/[username] Update a user’s profile
  75. 75. $ git clone http://www.github.com/mattbates/mycms_mongodb $ cd mycms-mongodb $ virtualenv venv $ source venv/bin/activate $ pip install –r requirements.txt $ mkdir –p data/db $ mongod --dbpath=data/db --fork --logpath=mongod.log $ python web.py [$ deactivate] Getting started with the skeleton code
  76. 76. @app.route('/cms/api/v1.0/articles', methods=['GET']) def get_articles(): """Retrieves all articles in the collection sorted by date """ # query all articles and return a cursor sorted by date cur = db['articles'].find().sort('date’) if not cur: abort(400) # iterate the cursor and add docs to a dict articles = [article for article in cur] return jsonify({'articles' : json.dumps(articles, default=json_util.default)}) RESTful API methods in Python + Flask
  77. 77. @app.route('/cms/api/v1.0/articles/<string:article_id>/comments', methods = ['POST']) def add_comment(article_id): """Adds a comment to the specified article and a bucket, as well as updating a view counter "”” … page_id = article['last_comment_id'] // 100 … # push the comment to the latest bucket and $inc the count page = db['comments'].find_and_modify( { 'article_id' : ObjectId(article_id), 'page' : page_id}, { '$inc' : { 'count' : 1 }, '$push' : { 'comments' : comment } }, fields= {'count' : 1}, upsert=True, new=True) RESTful API methods in Python + Flask
  78. 78. # $inc the page count if bucket size (100) is exceeded if page['count'] > 100: db.articles.update( { '_id' : article_id, 'comments_pages': article['comments_pages'] }, { '$inc': { 'comments_pages': 1 } } ) # let's also add to the article itself # most recent 10 comments only res = db['articles'].update( {'_id' : ObjectId(article_id)}, {'$push' : {'comments' : { '$each' : [comment], '$sort' : {’date' : 1 }, '$slice' : -10}}, '$inc' : {'comment_count' : 1}}) … RESTful API methods in Python + Flask
  79. 79. def add_interaction(article_id, type): """Record the interaction (view/comment) for the specified article into the daily bucket and update an hourly counter """ ts = datetime.datetime.utcnow() # $inc daily and hourly view counters in day/article stats bucket # note the unacknowledged w=0 write concern for performance db['interactions'].update( { 'article_id' : ObjectId(article_id), 'date' : datetime.datetime(ts.year, ts.month, ts.day)}, { '$inc' : { 'daily.{}’.format(type) : 1, 'hourly.{}.{}'.format(ts.hour, type) : 1 }}, upsert=True, w=0) RESTful API methods in Python + Flask
  80. 80. $ curl -i http://localhost:5000/cms/api/v1.0/articles HTTP/1.0 200 OK Content-Type: application/json Content-Length: 335 Server: Werkzeug/0.9.4 Python/2.7.5 Date: Thu, 10 Apr 2014 16:00:51 GMT { "articles": "[{"title": "Schema design in MongoDB", "text": "Data in MongoDB has a flexible schema..", "section": "schema", "author": "sarangs", "date": {"$date": 1397145312505}, "_id": {"$oid": "5346bef5f2610c064a36a793"}, "slug": "schema-design-in-mongodb", "tags": ["MongoDB", "schema"]}]"} Testing the API – retrieve articles
  81. 81. $ curl -H "Content-Type: application/json" -X POST -d '{"text":"An interesting article and a great read."}' http://localhost:5000/cms/api/v1.0/articles/52ed73a30bd031362b3c6bb3/comment s { "comment": "{"date": {"$date": 1391639269724}, "text": "An interesting article and a great read."}” } Testing the API – comment on an article
  82. 82. Disaster Recovery Introduction to Replica Sets and High Availability
  83. 83. Disasters • Physical Failure – Hardware – Network • Solution – Replica Sets • Provide redundant storage for High Availability – Real time data synchronization • Automatic failover for zero down time
  84. 84. Replication
  85. 85. Multi Replication • Data can be replicated to multiple places simultaneously • Odd number of machines are always needed in a replica set
  86. 86. Single Replication • If you want to have only one or odd number of secondary, you need to setup an arbiter
  87. 87. Failover • When primary fails, remaining machines vote for electing new primary
  88. 88. Handling Big Data Introduction to Map/Reduce and Sharding
  89. 89. Large Data Sets • Problem 1 – Performance • Queries go slow • Solution – Map/Reduce
  90. 90. Aggregation
  91. 91. Map Reduce • A way to divide large query computation into smaller chunks • May run in multiple processes across multiple machines • Think of it as GROUP BY of SQL
  92. 92. Map/Reduce Example • Map function digs the data and returns required values
  93. 93. Map/Reduce Example • Reduce function uses the output of Map function and generates aggregated value
  94. 94. Large Data Sets • Problem 2 – Vertical Scaling of Hardware • Can’t increase machine size beyond a limit • Solution – Sharding
  95. 95. Sharding • Amethod for storing data across multiple machines • Data is partitioned using Shard Keys
  96. 96. Data Partitioning: Range Based • Arange of Shard Keys stay in a chunk
  97. 97. Data Partitioning: Hash Bsed • Ahash function on Shard Keys decides the chunk
  98. 98. Sharded Cluster
  99. 99. Optimizing Shards: Splitting • In a shard, when size of a chunk increases, the chunk is divided into two
  100. 100. Optimizing Shards: Balancing • When number of chunks in a shard increase, a few chunks are migrated to other shard
  101. 101. Schema iteration New feature in the backlog? Documents have dynamic schema so we just iterate the object schema. >>> user = { ‘username’: ‘matt’, ‘first’ : ‘Matt’, ‘last’ : ‘Bates’, ‘preferences’: { ‘opt_out’: True } } >>> user.save(user)
  102. 102. docs.mongodb.org
  103. 103. Online Training at MongoDB University
  104. 104. For More Information Resource Location MongoDB Downloads mongodb.com/download Free Online Training education.mongodb.com Webinars and Events mongodb.com/events White Papers mongodb.com/white-papers Case Studies mongodb.com/customers Presentations mongodb.com/presentations Documentation docs.mongodb.org Additional Info info@mongodb.com Resource Location
  105. 105. We've introduced a lot of concepts here
  106. 106. Schema Design @
  107. 107. Replication @
  108. 108. Indexing @
  109. 109. Sharding @
  110. 110. Questions?
  111. 111. Sarang Shravagi @_sarangs Thank You

×