Anti-social Databases

669 views

Published on

Why the innovation in the database market? This deck talks about why NoSQL databases are important, what they are, and then dives into MongoDB

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
669
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
24
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • Database requirements are changing … because of i) volume ii) Type of data iii) Agile Development, iv) New architectures.. V) New Apps
  • Anti-social Databases

    1. 1. Open source, high performance databaseAnti-social Databases: NoSQL and MongoDBWill LaForestSenior Director of 10gen Federalwill@10gen.com@WLaForest 1
    2. 2. SQL Dynamic invented Web Content released 10gen Web applications founded Oracle ClientIBM’s IMS founded Server SOA BigTable1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 Codd publishes PC’s gain 3 tier Cloud relational model paper traction architecture Computing in 1970 Brewer’s Cap NoSQL WWW born Movement born 2
    3. 3. AttributeTuple Relation 3
    4. 4. Category BlogCategory N 1 N 1 N 1 N 1 Author Blog BlogTag N N Content Tag 4
    5. 5. • Data stored in a RDBMS is very compact (disk was more expensive)• SQL and RDBMS made queries flexible with rigid schemas• Rigid schemas helps optimize joins and storage• Massive ecosystem of tools, libraries and integrations• Been around 40 years! 5
    6. 6. • Gartner uses the 3Vs to define• Volume - Big/difficult/extreme volume is relative• Variety – Changing or evolving data – Uncontrolled formats – Does not easily adhere to a single schema – Unknown at design time• Velocity – High or volatile inbound data – High query and read operations – Low latency 6
    7. 7. VOLUME/VELOCITY & NEW ARCHITECTURES • Systems need to scale horizontal not vertically • Commodity servers • Cloud ComputingDATA VARIETY &VOLATILITY• Extremely difficult to find a single fixed schema• Don’t know data schema a-priori AGILE DEVELOPMENT • Iterative & continuous • New and emerging Apps 7
    8. 8. • Non-relational been hanging around (heard of MUMPS?)• Modern NoSQL theory and offerings started in early 2000s• NoSQL = Not Only SQL• A collection of very different products• Alternatives to relational databases when a bad fit• Common motivations – Horizontally scalable (commodity server/cloud computing) – Schema Flexibility 8
    9. 9. • I group by data model – Some people use or include data arangement• Data arrangement is important and cuts across data models (see my talk at BigData DC next week) – Column/Document – Consistent hashing partitioning – Range based partitioning• Key/Value• Big Table Descendents• Document Oriented• Graph 9
    10. 10. • Value (Data) mapped to a key (think primary)• Some typed, some just BLOBs• Fast hash based queries• No range queries, no ordering, simple data model “Will” • “will@10gen.com” “Chris” • “chris@10gen.com” Will-obj • [4e61 6d65 3a57 … 10
    11. 11. • Data stored on disk in a column oriented fashion• Predominantly hash based indexing• Rudimentary or no secondary indexes• Range queries and ordering on one dimension (row key)• Some consistent hashing some range based row keys column family column family “contact” “personal” “twitter”: “wlaforest” “bio”: “Will attended … ” row Will “email”: “will@10gen.com “picture”: … “phone”: “555-555-5555” “bio”: “ … ” “email”: “chris@10gen.com” row Chris “picture”: … “phone”: “555-555-5555” “hobby”: “golf” 11
    12. 12. • Data modeled as documents or objects (XML and JSON)• No fixed schema• Richest data model• Consistent hashing and range based partitioning {name: “will”, name: “jeff”, {name: “brendan”, eyes: “blue”, eyes: “blue”, aliases: [“el diablo”]} birthplace: “NY”, height: 72, aliases: [“bill”, “la boss: “ben”} ciacco”], {name: “matt”, gender: ”???”, pizza: “DiGiorno”, boss: ”ben”} name: “ben”, height: 72, hat: ”yes”} boss: 555.555.1212} 12
    13. 13. • Not a database!• Map Reduce on HDFS or other data source• Great for grinding through data• When you want to use it – Can’t use a index – Distributing custom algorithms – ETL• Many NoSQL offerings have native map reduce functionality 13
    14. 14. 14
    15. 15. • 2007 founded• 2009 first release of MongoDB• MongoDB is open source• 10gen has a Redhat-like business model – Subscriptions (subscriber build, support) – Training – Consulting• ~80M in funding – Sequoia, NEA, In-Q-Tel, Union Square Ventures, Flybridge Capital, 15
    16. 16. #2 on Indeed’s Fastest Growing Jobs Jaspersoft BigData Index Demand for MongoDB, the document-oriented NoSQL database, saw the biggest spike with over 200% growth in 2011. 451 Group Google Searches “MongoDB increasing its dominance” 16
    17. 17. #2 ON INDEED’S FASTEST GROWING JOBS 17
    18. 18. “MongoDB INCREASING ITS DOMINANCE” 18
    19. 19. • Scale horizontally over commodity hardware• Agility essential (schema free/heterogeneous interface)• RDBMSs great so keep what works – Rich data models – Adhoc queries – Fully featured indexes• What doesn’t distribute well? – Long running multi-row transactions – Join – Both artifacts of the relational data model 19
    20. 20. • Data stored as documents (JSON)• Schema free• CRUD operations – (Create Read Update Delete)• Atomic document operations• Consistent (but is tunable… advanced topic)• Rich indexing (secondary, geospatial, covered)• Ad hoc Queries like SQL – Equality – Ranges – Regular expression searches – Geospatial• Replication – HA, read scalability, geo centric reads• Sharding (sometimes called partitioning) for scalability 20
    21. 21. RDBMS MongoDBDatabase DatabaseTable CollectionRow Document 21
    22. 22. 22
    23. 23. var p = { author: “roger”, date: new Date(), text: “Spirited Away”, tags: *“Tezuka”, “Manga”+-> db.posts.save(p) 23
    24. 24. >db.posts.find() { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ] }Notes: - _id is unique, but can be anything you’d like 24
    25. 25. Create index on any Field in Document // 1 means ascending, -1 means descending >db.posts.ensureIndex({author: 1}) >db.posts.find({author: roger}) { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", ... } 25
    26. 26. • Conditional Operators – $all, $exists, $mod, $ne, $in, $nin, $nor, $or, $size, $type – $lt, $lte, $gt, $gte // find posts with any tags > db.posts.find( {tags: {$exists: true }} ) // find posts matching a regular expression > db.posts.find( {author: /^rog*/i } ) // count posts by an author before a certain date > db.posts.find( {author: ‘roger’, date:, $lt: Sat… -- ).count() 26
    27. 27. • $set, $unset, $inc, $push, $pushAll, $pull, $pullAll, $bit> comment = { author: “fred”, date: new Date(), text: “Best Movie Ever”-> db.posts.update( { _id: “...” -, $push: {comments: comment} ); 27
    28. 28. { _id : ObjectId("4c4ba5c0672c685e5e8aabf3"), author : "roger", date : "Sat Jul 24 2010 19:47:11 GMT-0700 (PDT)", text : "Spirited Away", tags : [ "Tezuka", "Manga" ], comments : [ { author : "Fred", date : "Sat Jul 24 2010 20:51:03 GMT-0700 (PDT)", text : "Best Movie Ever" } ]} 28
    29. 29. // Index nested documents> db.posts.ensureIndex( “comments.author”:1 ) db.posts.find(,‘comments.author’:’Fred’-)// Index on tags> db.posts.ensureIndex( tags: 1)> db.posts.find( { tags: ’Manga’ - )// geospatial index> db.posts.ensureIndex( “author.location”: “2d” )> db.posts.find( “author.location” : , $near : [22,42] } ) 29
    30. 30. 30
    31. 31. • Do them client side (Python or R)• Native Map/Reduce in JS in MongoDB – Distributes across the cluster with good data locality• New aggregation framework – Declarative (no JS required) – Pipeline approach (like Unix ps -ax | tee processes.txt | more)• Hadoop – Intersect the indexing of MongoDB with the brute force parallelization of hadoop – Hadoop MongoDB connector 31
    32. 32. 32
    33. 33. • Prod clusters on order of • 100B objects • 50k qps per server • ~1400 nodesBetter data locality In-Memory Auto-Sharding Caching Replication /HA Horizontal Scaling 33
    34. 34. • Sharding Details• Replica Set Details• Consistency Details• Common Deployment Scenarios• Citations 34
    35. 35. 35
    36. 36. Write Primary Secondary Read SecondaryDriver Read 36
    37. 37. Write PrimaryDriver Read Secondary Secondary 37
    38. 38. 1. Write Primary Secondary 1. Replicate 2. Read SecondaryDriver 2. Read 38
    39. 39. • Fire and forget• Wait for error• Wait for fsync• Wait for journal sync• Wait for replication 39
    40. 40. Driver Primary write apply in memory 40
    41. 41. Driver Primary write getLastError apply in memory 41
    42. 42. Driver Primary write getLastError apply in memory j:true Write to journal 42
    43. 43. Driver Primary write getLastError apply in memory fsync:true fsync 43
    44. 44. Driver Primary Secondary write getLastError apply in memory w:2 replicate 44
    45. 45. Value Meaning<n:integer> Replicate to N members of replica set“majority” Replicate to a majority of replica set members<m:modeName> Use cutom error mode name 45
    46. 46. 46
    47. 47. > db.runCommand( { shardcollection: “test.users”, key: { email: 1 }} ) { name: “Jared”, email: “jsr@10gen.com”, } { name: “Scott”, email: “scott@10gen.com”, } { name: “Dan”, email: “dan@10gen.com”, } 47
    48. 48. -∞ +∞ 48
    49. 49. -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 49
    50. 50. Split!-∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 50
    51. 51. This is a Split! This is a chunk chunk-∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 51
    52. 52. -∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 52
    53. 53. Split!-∞ +∞ dan@10gen.com scott@10gen.com jsr@10gen.com 53
    54. 54. -∞ adam@10gen.com 1adam@10gen.com jared@10gen.com 1jared@10gen.com scott@10gen.com 1scott@10gen.com +∞ 1• Stored in the config serers• Cached in mongos• Used to route requests and keep cluster balanced 54
    55. 55. mongos config balancer configChunks! config 1 2 3 4 13 14 15 16 25 26 27 28 37 38 39 40 5 6 7 8 17 18 19 20 29 30 31 32 41 42 43 44 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48 Shard 1 Shard 2 Shard 3 Shard 4 55
    56. 56. mongos config balancer config Imbalance config 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 56
    57. 57. mongos config balancer config Move chunk 1 to config Shard 2 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 57
    58. 58. mongos config balancer config config 1 2 3 4 5 6 7 8 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 58
    59. 59. mongos config balancer config Chunks 1,2, and 3 have migrated config 4 5 6 7 8 1 2 3 9 10 11 12 21 22 23 24 33 34 35 36 45 46 47 48Shard 1 Shard 2 Shard 3 Shard 4 59
    60. 60. 60
    61. 61. 1 1.Query arrives at mongos 4 mongos 2.mongos routes query to a single shard 3.Shard returns results 2 of query 3 4.Results returned to clientShard 1 Shard 2 Shard 3 61
    62. 62. 1 1.Query arrives at mongos 4 mongos 2.mongos broadcasts query to all shards 3.Each shard returns 2 results for query 2 2 3 3 3 4.Results combined and returned to clientShard 1 Shard 2 Shard 3 62
    63. 63. 1 1.Query arrives at mongos 6 2.mongos broadcasts query mongos to all shards 5 3.Each shard locally sorts results 2 4.Results returned to 2 mongos 2 4 4 4 5.mongos merge sorts individual results 3 3 3 6.Combined sorted resultShard 1 Shard 2 Shard 3 returned to client 63
    64. 64. Inserts Requires shard db.users.insert({ key name: “Jared”, email: “jsr@10gen.com”})Removes Routed db.users.delete({ email: “jsr@10gen.com”}) Scattered db.users.delete({name: “Jared”})Updates Routed db.users.update( {email: “jsr@10gen.com”}, {$set: { state: “CA”}}) Scattered db.users.update( {state: “FZ”}, {$set:{ state: “CA”}} ) 64
    65. 65. By Shard Routed db.users.find( {email: “jsr@10gen.com”})KeySorted by Routed in order db.users.find().sort({email:-1})shard keyFind by non Scatter Gather db.users.find({state:”CA”})shard keySorted by Distributed merge db.users.find().sort({state:1}) sortnon shardkey 65
    66. 66. 66
    67. 67. Data Center Primary Secondary Secondary 67
    68. 68. Data Center Primary Secondary Secondary hidden=true backups 68
    69. 69. Active Data Center Standby Data Center Primary Secondary Secondary priority = 1 priority = 1 69
    70. 70. West Coast DC Central DC East Coast DC Secondary Primary Secondary priority = 1 70
    71. 71. 71
    72. 72. • History of Database Management (http://bit.ly/w3r0dv)• EMC IDC Study (http://bit.ly/y1mJgJ)• Gartner & Big Data (http://bit.ly/xvRP3a)• SQL (http://en.wikipedia.org/wiki/SQL)• Database Management Systems http://en.wikipedia.org/wiki/Dbms)• Dynamo: Amazon’s Highly Available Key-value Store (http://bit.ly/A8F8oy)• CAP Theorem (http://bit.ly/zvA6O6)• NoSQL Google File System and BigTable (http://oreil.ly/wOXliP)• NoSQL Movement whitepaper (http://bit.ly/A8RBuJ)• Sample ERD diagram (http://bit.ly/xV30v) 72
    73. 73. 73
    74. 74. 74
    75. 75. $project $match $limit $skip $unwind $group $sort{ db.article.aggregate( title : “this is my title” , { $project : { author : “bob” , author : 1, posted : new Date () , tags : 1, pageViews : 5 , }}, tags : [ “fun” , “good” , “fun” ] , { $unwind : "$tags" }, comments : [ { $group : { { author :“joe” , text : “this is cool” } , _id : “$tags”, { author :“sam” , text : “this is bad” } authors : { $addToSet : "$author" } ], }} other : { foo : 5 } );} 75
    76. 76. 76
    77. 77. Input data MAP Intermediate data REDUCE Output data 1 1 2 2 3 3 77
    78. 78. Input data MAP Intermediate data REDUCE Output data 1 1 2 2 3 3 78
    79. 79. • Impossible for a distributed computer system to simultaneously provide all three of the following guarantees – Consistency - All nodes see the same data at the same time. – Availability - A guarantee that every request receives a response about whether it was successful or failed. – Partition tolerance - No set of failures less than total network failure is allowed to cause the system to respond incorrectly 79

    ×