Open source, high performance databaseIntroduction to NoSQL and MongoDBWill LaForestSenior Director of 10gen Federalwill@1...
SQL                              Dynamic                   invented                          Web Content                  ...
AttributeTuple                    Relation                               3
4
• Data stored in a RDBMS is very compact (disk was  more expensive)• SQL and RDBMS made queries flexible with rigid  schem...
• Gartner uses the 3Vs to define• Volume - Big/difficult/extreme volume is relative• Variety   –   Changing or evolving da...
VOLUME & NEW                             ARCHITECTURES                             • Systems scaling horizontally,        ...
• Non-relational been hanging around (MUMPS?)• Modern NoSQL theory and offerings started in early  2000s• Modern usage of ...
• Value (Data) mapped to a key (think primary)• Some typed some just BLOBs• Redis, MemcachDB, Voldemort        “Will”     ...
•   Data stored on disk in a column oriented fashion•   Predominantly hash based indexing•   Data partitioned by range or ...
• Key-Value Stores   – Key-Value Stores   – Value (Data) mapped to a key (think primary)• Big Table Descendants   – Looks ...
• Not a database• Map reduce on HDFS or other data source• When you want to use it  – Can’t use a index  – Distributing cu...
13
• 2007  – Eliot Horowitz & Dwight Merriman tired of reinventing the    wheel  – 10gen founded  – MongoDB Development begin...
• Pre-production Subscriptions for MongoDB   – Fixed cost = $30k   – 6 month term, unlimited servers• MongoDB Subscription...
• 4 servers with 2 processors each• Total processors equals 8• Oracle Enterprise Edition   – $47,500 per processor plus ma...
#2 on Indeed’s Fastest Growing Jobs        Jaspersoft BigData Index                                                       ...
#2 ON INDEED’S FASTEST GROWING JOBS                                      18
“MongoDB INCREASING ITS DOMINANCE”                                     19
• Scale horizontally over commodity hardware• RDBMSs great so keep what works   – Rich data models   – Adhoc queries   – F...
•   Data stored as documents (JSON)•   Schema free•   CRUD operations – (Create Read Update Delete)•   Atomic document ope...
•   MongoDB does not need any defined data schema.•   Every document could have different data!    {name: “will”,         ...
Seek = 5+ ms          Read = really really fast               Post                           CommentAuthor                ...
Post  Author  Comment  Comment   Comment    Comment    Comment              24
RDBMS      MongoDBDatabase   DatabaseTable      CollectionRow        Document                        25
• We are running instances on the order of:  -   100B objects  -   50TB storage  -   50K qps per server  -   ~1200 servers...
• SSL   – between client and server   – Intra-cluster communication• Authorization at the database level   – Read Only/Rea...
28
var p = { author: “roger”,     date: new Date(),     text: “Spirited Away”,     tags: *“Tezuka”, “Manga”+-> db.posts.save(...
>db.posts.find() { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),   author : "roger",   date : "Sat Jul 24 2010 19:47:11 GMT-...
Create index on any Field in Document // 1 means ascending, -1 means descending >db.posts.ensureIndex({author: 1}) >db.pos...
• Conditional Operators  – $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type  – $lt, $lte, $gt, $gte  // find p...
• $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit> comment = { author: “fred”,           date: new Date(),     ...
{ _id : ObjectId("4c4ba5c0672c685e5e8aabf3"),  author : "roger",  date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)",  text ...
// Index nested documents> db.posts.ensureIndex( “comments.author”:1 ) db.posts.find(,‘comments.author’:’Fred’-)// Index ...
36
•   Native Map/Reduce in JS in MongoDB    –   Distributes across the cluster with good data locality•   New aggregation fr...
38
$project                 $match                  $limit              $skip   $unwind                  $group              ...
40
Input data   MAP   Intermediate data   REDUCE   Output data              1                          1              2      ...
Input data   MAP   Intermediate data   REDUCE   Output data              1                          1              2      ...
43
Write                 Primary         Read                             Asynchronous                 Secondary   Replicatio...
Primary                Secondary         ReadDriver                Secondary         Read                            45
Primary         Write                 Primary     Automatic         Read                Leader ElectionDriver             ...
Secondary         Read         Write         Read    PrimaryDriver                 Secondary         Read                 ...
Additional Details                           Clients            MongoS         MongoS               MongoS                ...
•   Sharding Details•   Replica Set Details•   Consistency Details•   Common Deployment Scenarios•   Citations            ...
50
Write                 Primary                 Secondary         Read                 SecondaryDriver         Read         ...
Write                 PrimaryDriver         Read                 Secondary                 Secondary                      ...
1. Write                    Primary                    Secondary   1. Replicate         2. Read                    Seconda...
• Fire and forget• Wait for error• Wait for fsync• Wait for journal sync• Wait for replication                          54
Driver           Primary         write                           apply in memory                                          ...
Driver             Primary            write         getLastError                             apply in memory              ...
Driver              Primary            write         getLastError                              apply in memory           j...
Driver             Primary            write         getLastError                             apply in memory          fsyn...
Driver             Primary                     Secondary            write         getLastError                            ...
Value          Meaning<n:integer>    Replicate to N members of               replica set“majority”     Replicate to a majo...
61
> db.runCommand( { shardcollection: “test.users”,          key: { email: 1 }} )      {          name: “Jared”,          em...
-∞   +∞      63
-∞                                              +∞     dan@10gen.com            scott@10gen.com              jsr@10gen.com...
Split!-∞                                              +∞     dan@10gen.com            scott@10gen.com              jsr@10g...
This is a                   Split!        This is a     chunk                                     chunk-∞                 ...
-∞                                              +∞     dan@10gen.com            scott@10gen.com              jsr@10gen.com...
Split!-∞                                              +∞     dan@10gen.com            scott@10gen.com              jsr@10g...
-∞                adam@10gen.com    1adam@10gen.com    jared@10gen.com   1jared@10gen.com   scott@10gen.com   1scott@10gen...
mongos                                                                          config                                    ...
mongos                                                                          config                                    ...
mongos                                                                          config                                    ...
mongos                                                                         config                                    b...
mongos                                                                         config                                    b...
75
1             1.Query arrives at                                           mongos                                4        ...
1               1.Query arrives at                                               mongos                                 4 ...
1               1.Query arrives at mongos                                   6           2.mongos broadcasts query         ...
Inserts   Requires shard   db.users.insert({          key                name: “Jared”,                             email:...
By Shard      Routed            db.users.find(                                  {email: “jsr@10gen.com”})KeySorted by     ...
81
Data Center     Primary   Secondary   Secondary                                       82
Data Center     Primary   Secondary   Secondary                           hidden=true                            backups  ...
Active Data Center                  Standby Data Center  Primary            Secondary          Secondary  priority = 1    ...
West Coast DC   Central DC      East Coast DC Secondary       Primary           Secondary                 priority = 1    ...
86
•   History of Database Management (http://bit.ly/w3r0dv)•   EMC IDC Study (http://bit.ly/y1mJgJ)•   Gartner & Big Data (h...
88
• Impossible for a distributed computer system to  simultaneously provide all three of the following  guarantees   – Consi...
Upcoming SlideShare
Loading in...5
×

An Introduction to Big Data, NoSQL and MongoDB

9,306

Published on

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
9,306
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
385
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • Database requirements are changing … because of i) volume ii) Type of data iii) Agile Development, iv) New architectures.. V) New Apps
  • Detailed explanation of sharding in optional slides
  • Transcript of "An Introduction to Big Data, NoSQL and MongoDB"

    1. 1. Open source, high performance databaseIntroduction to NoSQL and MongoDBWill LaForestSenior Director of 10gen Federalwill@10gen.com@WLaForest 1
    2. 2. SQL Dynamic invented Web Content released 10gen Web applications founded Oracle ClientIBM’s IMS founded Server SOA BigTable1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Codd publishes PC’s gain 3 tier Cloud relational model paper traction architecture Computing in 1970 Brewer’s Cap NoSQL WWW born Movement born 2
    3. 3. AttributeTuple Relation 3
    4. 4. 4
    5. 5. • Data stored in a RDBMS is very compact (disk was more expensive)• SQL and RDBMS made queries flexible with rigid schemas• Rigid schemas helps optimize joins and storage• Massive ecosystem of tools, libraries and integrations• Been around 40 years! 5
    6. 6. • Gartner uses the 3Vs to define• Volume - Big/difficult/extreme volume is relative• Variety – Changing or evolving data – Uncontrolled formats – Does not easily adhere to a single schema – Unknown at design time• Velocity – High or volatile inbound data – High query and read operations – Low latency 6
    7. 7. VOLUME & NEW ARCHITECTURES • Systems scaling horizontally, not vertically • Commodity servers • Cloud ComputingDATA VARIETY &VOLATILITY• Extremely difficult to find a single fixed schema• Don’t know data schema a-priori TRANSACTIONAL MODEL • N x Inserts or updates • Distributed transactions 7
    8. 8. • Non-relational been hanging around (MUMPS?)• Modern NoSQL theory and offerings started in early 2000s• Modern usage of term introduced in 2009• NoSQL = Not Only SQL• A collection of very different products• Alternatives to relational databases when they are a bad fit• Motives – Horizontally scalable (commodity server/cloud computing) – Flexibility 8
    9. 9. • Value (Data) mapped to a key (think primary)• Some typed some just BLOBs• Redis, MemcachDB, Voldemort “Will” • “@WLaForest” “Chris” • 12 “Robert” • BLOB 9
    10. 10. • Data stored on disk in a column oriented fashion• Predominantly hash based indexing• Data partitioned by range or consistent hashing• Google BigTable, HBase, Cassandra, Accumulo 10
    11. 11. • Key-Value Stores – Key-Value Stores – Value (Data) mapped to a key (think primary)• Big Table Descendants – Looks like a distributed multi-dimension map – Data stored in a column oriented fashion – Predominantly hash based indexing• Document oriented stores – Data stored as either JSON or XML documents• What do they all have in common? 11
    12. 12. • Not a database• Map reduce on HDFS or other data source• When you want to use it – Can’t use a index – Distributing custom algorithms – ETL• Great for grinding through data• Many NoSQL offerings have native map reduce functionality 12
    13. 13. 13
    14. 14. • 2007 – Eliot Horowitz & Dwight Merriman tired of reinventing the wheel – 10gen founded – MongoDB Development begins• 2009 – Initial release of MongoDB• 73M+ in funding – Funded by Sequoia, NEA, Union Square Ventures, Flybridge Capital 14
    15. 15. • Pre-production Subscriptions for MongoDB – Fixed cost = $30k – 6 month term, unlimited servers• MongoDB Subscriptions – 32 month term, $4k per server – 24x7 support for development and production, 1 hour response time – Onboarding call and quarterly review• Consulting – Cost is $300/hr• Training – MongoDB Developer - $1500/student, 2 day course – MongoDB Administrator - $1500/student, 2 day course – MongoDB Essential - $2250/student, 3 day course• MongoDB Monitoring Service 15
    16. 16. • 4 servers with 2 processors each• Total processors equals 8• Oracle Enterprise Edition – $47,500 per processor plus maintenance – 8 x 47,500 = $380,000 + $76,000 – Total Cost = $456,000• MongoDB Subscription – $4,000 per server (no processor count) – No license fee or maintenance charge – 4 x 4,000 = $16,000 – Total Cost = $16,000• MongoDB is $440,000 less expensive 16
    17. 17. #2 on Indeed’s Fastest Growing Jobs Jaspersoft BigData Index Demand for MongoDB, the document-oriented NoSQL database, saw the biggest spike with over 200% growth in 2011. 451 Group Google Searches “MongoDB increasing its dominance” 17
    18. 18. #2 ON INDEED’S FASTEST GROWING JOBS 18
    19. 19. “MongoDB INCREASING ITS DOMINANCE” 19
    20. 20. • Scale horizontally over commodity hardware• RDBMSs great so keep what works – Rich data models – Adhoc queries – Fully featured indexes• What doesn’t distribute well? – Long running multi-row transactions – Joins – Both artifacts of the relational data model• Do not homogenize programming interfaces• Local storage first class citizen for DB storage 20
    21. 21. • Data stored as documents (JSON)• Schema free• CRUD operations – (Create Read Update Delete)• Atomic document operations• Ad hoc Queries like SQL – Equality – Regular expression searches – Ranges – Geospatial• Secondary indexes• Sharding (sometimes called partitioning) for scalability• Replication – HA and read scalability 21
    22. 22. • MongoDB does not need any defined data schema.• Every document could have different data! {name: “will”, name: “jeff”, {name: “brendan”, eyes: “blue”, eyes: “blue”, aliases: [“el diablo”]} birthplace: “NY”, height: 72, aliases: [“bill”, “la boss: “ben”} ciacco”], {name: “matt”, gender: ”???”, pizza: “DiGiorno”, boss: ”ben”} name: “ben”, height: 72, hat: ”yes”} boss: 555.555.1212} 22
    23. 23. Seek = 5+ ms Read = really really fast Post CommentAuthor 23
    24. 24. Post Author Comment Comment Comment Comment Comment 24
    25. 25. RDBMS MongoDBDatabase DatabaseTable CollectionRow Document 25
    26. 26. • We are running instances on the order of: - 100B objects - 50TB storage - 50K qps per server - ~1200 servers 26
    27. 27. • SSL – between client and server – Intra-cluster communication• Authorization at the database level – Read Only/Read+Write/Administrator• Security Roadmap (tentative) – Pluggable authentication 2.4 – Auditing 2.4 – Cell level security 2.6 – Common Criteria certification 27
    28. 28. 28
    29. 29. var p = { author: “roger”, date: new Date(), text: “Spirited Away”, tags: *“Tezuka”, “Manga”+-> db.posts.save(p) 29
    30. 30. >db.posts.find() { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ] }Notes: - _id is unique, but can be anything you’d like 30
    31. 31. Create index on any Field in Document // 1 means ascending, -1 means descending >db.posts.ensureIndex({author: 1}) >db.posts.find({author: roger}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", ... } 31
    32. 32. • Conditional Operators – $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type – $lt, $lte, $gt, $gte // find posts with any tags > db.posts.find( {tags: {$exists: true }} ) // find posts matching a regular expression > db.posts.find( {author: /^rog*/i } ) // count posts by author > db.posts.find( {author: ‘roger’- ).count() 32
    33. 33. • $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit> comment = { author: “fred”, date: new Date(), text: “Best Movie Ever”-> db.posts.update( { _id: “...” -, $push: {comments: comment} ); 33
    34. 34. { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ], comments : [ { author : "Fred", date : "Sat Jul 24 2010 20:51:03 GMT-0700 (PDT)", text : "Best Movie Ever" } ]} 34
    35. 35. // Index nested documents> db.posts.ensureIndex( “comments.author”:1 ) db.posts.find(,‘comments.author’:’Fred’-)// Index on tags> db.posts.ensureIndex( tags: 1)> db.posts.find( { tags: ’Manga’ - )// geospatial index> db.posts.ensureIndex( “author.location”: “2d” )> db.posts.find( “author.location” : , $near : [22,42] } ) 35
    36. 36. 36
    37. 37. • Native Map/Reduce in JS in MongoDB – Distributes across the cluster with good data locality• New aggregation framework – Declarative (no JS required) – Pipeline approach (like Unix ps -ax | tee processes.txt | more)• Hadoop – Intersect the indexing of MongoDB with the brute force parallelization of hadoop – Hadoop MongoDB connector 37
    38. 38. 38
    39. 39. $project $match $limit $skip $unwind $group $sort{ db.article.aggregate( title : “this is my title” , { $project : { author : “bob” , author : 1, posted : new Date () , tags : 1, pageViews : 5 , }}, tags : [ “fun” , “good” , “fun” ] , { $unwind : "$tags" }, comments : [ { $group : { { author :“joe” , text : “this is cool” } , _id : “$tags”, { author :“sam” , text : “this is bad” } authors : { $addToSet : "$author" } ], }} other : { foo : 5 } );} 39
    40. 40. 40
    41. 41. Input data MAP Intermediate data REDUCE Output data 1 1 2 2 3 3 41
    42. 42. Input data MAP Intermediate data REDUCE Output data 1 1 2 2 3 3 42
    43. 43. 43
    44. 44. Write Primary Read Asynchronous Secondary Replication ReadDriver Secondary Read 44
    45. 45. Primary Secondary ReadDriver Secondary Read 45
    46. 46. Primary Write Primary Automatic Read Leader ElectionDriver Secondary Read 46
    47. 47. Secondary Read Write Read PrimaryDriver Secondary Read 47
    48. 48. Additional Details Clients MongoS MongoS MongoS Config ConfigKey Range Key Range Key Range Key Range0..30 31..60 61..90 91.. 100 Config Primary Primary Primary PrimarySecondary Secondary Secondary SecondarySecondary Secondary Secondary Secondary 48
    49. 49. • Sharding Details• Replica Set Details• Consistency Details• Common Deployment Scenarios• Citations 49
    50. 50. 50
    51. 51. Write Primary Secondary Read SecondaryDriver Read 51
    52. 52. Write PrimaryDriver Read Secondary Secondary 52
    53. 53. 1. Write Primary Secondary 1. Replicate 2. Read SecondaryDriver 2. Read 53
    54. 54. • Fire and forget• Wait for error• Wait for fsync• Wait for journal sync• Wait for replication 54
    55. 55. Driver Primary write apply in memory 55
    56. 56. Driver Primary write getLastError apply in memory 56
    57. 57. Driver Primary write getLastError apply in memory j:true Write to journal 57
    58. 58. Driver Primary write getLastError apply in memory fsync:true fsync 58
    59. 59. Driver Primary Secondary write getLastError apply in memory w:2 replicate 59
    60. 60. Value Meaning<n:integer> Replicate to N members of replica set“majority” Replicate to a majority of replica set members<m:modeName> Use cutom error mode name 60
    61. 61. 61
    62. 62. > db.runCommand( { shardcollection: “test.users”, key: { email: 1 }} ) { name: “Jared”, email: “jsr@10gen.com”, } { name: “Scott”, email: “scott@10gen.com”, } { name: “Dan”, email: “dan@10gen.com”, } 62
    63. 63. -∞ +∞ 63
    64. 64. -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 64
    65. 65. Split!-∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 65
    66. 66. This is a Split! This is a chunk chunk-∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 66
    67. 67. -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 67
    68. 68. Split!-∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 68
    69. 69. -∞ adam@10gen.com 1adam@10gen.com jared@10gen.com 1jared@10gen.com scott@10gen.com 1scott@10gen.com +∞ 1• Stored in the config serers• Cached in mongos• Used to route requests and keep cluster balanced 69
    70. 70. mongos config balancer configChunks! config 1 2 3 4 13 14 15 16 25 26 27 28 37 38 39 40 5 6 7 8 17 18 19 20 29 30 31 32 41 42 43 44 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 70
    71. 71. mongos config balancer config Imbalance config 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 71
    72. 72. mongos config balancer config Move chunk 1 to config Shard 2 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 72
    73. 73. mongos config balancer config config 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 73
    74. 74. mongos config balancer config Chunks 1,2, and 3 have migrated config 4 5 6 7 8 1 2 3 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 74
    75. 75. 75
    76. 76. 1 1.Query arrives at mongos 4 mongos 2.mongos routes query to a single shard 3.Shard returns results 2 of query 3 4.Results returned to clientShard 1 Shard 2 Shard 3 76
    77. 77. 1 1.Query arrives at mongos 4 mongos 2.mongos broadcasts query to all shards 3.Each shard returns 2 results for query 2 2 3 3 3 4.Results combined and returned to clientShard 1 Shard 2 Shard 3 77
    78. 78. 1 1.Query arrives at mongos 6 2.mongos broadcasts query mongos to all shards 5 3.Each shard locally sorts results 2 4.Results returned to 2 mongos 2 4 4 4 5.mongos merge sorts individual results 3 3 3 6.Combined sorted resultShard 1 Shard 2 Shard 3 returned to client 78
    79. 79. Inserts Requires shard db.users.insert({ key name: “Jared”, email: “jsr@10gen.com”})Removes Routed db.users.delete({ email: “jsr@10gen.com”}) Scattered db.users.delete({name: “Jared”})Updates Routed db.users.update( {email: “jsr@10gen.com”}, {$set: { state: “CA”}}) Scattered db.users.update( {state: “FZ”}, {$set:{ state: “CA”}} ) 79
    80. 80. By Shard Routed db.users.find( {email: “jsr@10gen.com”})KeySorted by Routed in order db.users.find().sort({email:-1})shard keyFind by non Scatter Gather db.users.find({state:”CA”})shard keySorted by Distributed merge db.users.find().sort({state:1}) sortnon shardkey 80
    81. 81. 81
    82. 82. Data Center Primary Secondary Secondary 82
    83. 83. Data Center Primary Secondary Secondary hidden=true backups 83
    84. 84. Active Data Center Standby Data Center Primary Secondary Secondary priority = 1 priority = 1 84
    85. 85. West Coast DC Central DC East Coast DC Secondary Primary Secondary priority = 1 85
    86. 86. 86
    87. 87. • History of Database Management (http://bit.ly/w3r0dv)• EMC IDC Study (http://bit.ly/y1mJgJ)• Gartner & Big Data (http://bit.ly/xvRP3a)• SQL (http://en.wikipedia.org/wiki/SQL)• Database Management Systems http://en.wikipedia.org/wiki/Dbms)• Dynamo: Amazon’s Highly Available Key-value Store (http://bit.ly/A8F8oy)• CAP Theorem (http://bit.ly/zvA6O6)• NoSQL Google File System and BigTable (http://oreil.ly/wOXliP)• NoSQL Movement whitepaper (http://bit.ly/A8RBuJ)• Sample ERD diagram (http://bit.ly/xV30v) 87
    88. 88. 88
    89. 89. • Impossible for a distributed computer system to simultaneously provide all three of the following guarantees – Consistency - All nodes see the same data at the same time. – Availability - A guarantee that every request receives a response about whether it was successful or failed. – Partition tolerance - No set of failures less than total network failure is allowed to cause the system to respond incorrectly 89
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×