DMDW Extra Lesson - NoSql and MongoDB

STUDIERENUND DURCHSTARTEN.Author: Dip.-Inf. (FH) Johannes HoppeDate: 06.05.2011

NoSQL and MongoDBAuthor: Dip.-Inf. (FH) Johannes HoppeDate: 06.05.2011

TrendsDataFacebook had 60k servers in 2010Google had 450k servers in 2006 (speculated)Microsoft: between 100k and 500k servers (since Azure)Amazon: likely has a similar numbers, too (S3)Facebook Server Footprint5

TrendsTrend 1: increasing data sizesTrend 2: more connectedness (“web 2.0”)Trend 3:moreindividualization (feverstructure)6

NoSQLDatabase paradigmsRelational (RDBMS)NoSQLKey-Value storesDocument databasesWide column stores (BigTable and clones)Graph databasesOther8

NoSQLSome NoSQL use cases1. Massive data volumesMassively distributed architecture required to store the dataGoogle, Amazon, Yahoo, Facebook…2. Extreme query workloadImpossible to efficiently do joins at that scale with an RDBMS3. Schema evolutionSchema flexibility (migration) is not trivial at large scaleSchema changes can be gradually introduced with NoSQ9

NoSQL - CAP theoremRequirements for distributed systems:ConsistencyAvailabilityPartition tolerance10

NoSQL - CAP theoremConsistencyThe system is in a consistent state after an operationAll clients see the same dataStrong consistency (ACID)vs. eventual consistency (BASE)ACID: Atomicity, Consistency, Isolation and DurabilityBASE: Basically Available, Soft state, Eventually consistent11

NoSQL - CAP theoremAvailabilityThe system is “always on”, no downtimeNode failure tolerance– all clients can find some available replicaSoftware/hardware upgrade tolerance12

NoSQL - CAP theoremPartition toleranceThe system continues to function even when Split into disconnected subsets (by a network disruption)Not only for reads, but writes as well!13

NoSQLCAP TheoremE. Brewer, N. LynchYou can satisfyat most 2 out of the 3 requirements14

NoSQLCAP Theorem  CASingle site clusters(easier to ensure all nodes are always in contact)When a partition occurs, the system blockse.g. usable for two-phase commits (2PC) which already require/use blocks 15

NoSQLCAP Theorem  CASingle site clusters(easier to ensure all nodes are always in contact)When a partition occurs, the system blockse.g. usable for two-phase commits (2PC) which already require/use blocks Obviously, any horizontal scaling strategy is based on data partitioning; therefore, designers are forced to decide between consistency and availability.16

NoSQLCAP Theorem  CPSome data may be inaccessible (availability sacrificed), but the rest is still consistent/accuratee.g. sharded database17

NoSQLCAP Theorem  APSystem is still available under partitioning,but some of the data returned my be inaccurateNeed some conflict resolution strategye.g. Master/Slave replication18

NoSQLRDBMSGuaratnee ACID by CA(two-phasecommits)SQLMature:19

NoSQLNoSQL DBMSNo relational tablesNo fixed table schemasNo joinsNo risk, no fun!CP and AP (and sometimes even AP and on top of CP  MongoDB*)* This is damn cool!20

NoSQLKey-valueOne key  one value, very fastKey: Hash (no duplicates)Value: binary object („BLOB“) (DB does not understand your content)Players: Amazon Dynamo, Memcached…21

NoSQLkeyvalue?=PQ)“§VN? =§(Q$U%V§W=(BN W§(=BU&W§$()= W§$(=%GIVE ME A MEANING!customer_2222

NoSQLDocument databasesKey-value store, tooValue is „understood“ by the DBQuerying the data is possible(not just retrieving the key‘s content)Players: Amazon SimpleDB, CouchDB, MongoDB …23

NoSQLkeyvalue{ Type: “Customer”, Name: "Norbert“,Invoiced: 2222 }customer_2224

NoSQLkeyvalue / documents{ Type: "Customer", Name: "Norbert", Invoiced: 2222 Messages: [ { Title: "Hello", Text: "World" }, { Title: "Second", Text: "message" } ] }customer_2225

NoSQL(Wide) column storesOften referred as “BigTable clones”Each key is associated with many attributes (columns)NoSQL column stores are actually hybrid row/column storesDifferent from “pure” relational column stores!Players: Google BigTable, Cassandra (Facebook), HBase…26

NoSQLWon‘t be stored as: It will be stored as:22;Norbert;22222 22;23;2423;Hans;50000 Norbert;Hans;Franz24;Franz;44000 22222;50000;4400027

NoSQLGraph databasesMulti-relational graphsSPARQL query language (W3C Recommendation!)Players: Neo4j, InfoGrid …(note: graph DBs are special and somehow the “black sheep” in the NoSQL world –the following PROs/CONs don’t apply very well)28

NoSQLPROs (& Promisses)Scheme-free / semi-structured dataMassive data storesScaling is easyVery, very high availabilityOften simpler to implement (and OR Mappers aren’t required)„Web 2.0 ready“29

NoSQLCONSsNoSQL implementations often „alpha“, no standardsData consistency, no transactions,Insufficient access controlSQL: strong for dynamic, cross-table queries (JOIN)Relationships aren‘t enforced (conventions over constrains – except for graph DBs (of course))Premature optimization: Scalability (Don’t build for scalability if you never need it!)30

NoSQL Lets rock! MongoDB Quick Reference Cardshttp://www.10gen.com/reference32

Basic DeploymentCreate the default data directory in c:\data\dbStart mongod.exeOptionally: mongod.exe --dbpath c:\data\db --port 27017 --logpath c:\data\mongodb.logStart the shell: mongo.exe33

Data Importcd c:\dba-training-data\datamongoimport -d twitter -c tweets twitter.jsoncd c:\dba-training-data\data\dump\trainingmongorestore -d training -c scores scores.bsoncd c:\dba-training-data\data\dumpmongorestore -d diggdigg34

MongoDB Documents(in the shell)use diggdb.stories.findOne();36

JSON  BSONAll JSON documents are stored in a binary format called BSON. BSON supports a richer set of types than JSON.http://bsonspec.org37

CRUD – Create(in the shell)db.people.save({name: 'Smith', age: 30});See how the save command works:db.foo.save38

CRUD – CreateHow training.scores was created:for(i=0; i<1000; i++) { ['quiz', 'essay', 'exam'].forEach(function(name) {var score = Math.floor(Math.random() * 50) + 50;db.scores.save({student: i, name: name, score: score}); }); }db.scores.count();39

CRUD – ReadQueries are specified using a document-style syntax!use trainingdb.scores.find({score: 50});db.scores.find({score: {"$gte": 70}});db.scores.find({score: {"$gte": 70}});Cursor!40

ExercisesFind all scores less than 65. Find the lowest quiz score. Find the highest quiz score. Write a query to find all digg stories where the view count is greater than 1000. Query for all digg stories whose media type is either 'news' or 'images' and where the topic name is 'Comedy’.(For extra practice, construct two queries using different sets of operators to do this. )Find all digg stories where the topic name is 'Television' or the media type is 'videos'. Skip the first 5 results, and limit the result set to 10.41

CRUD – Updateuse digg; db.people.update({name: 'Smith'}, {'$set': {interests: []}});db.people.update({name: 'Smith'}, {'$push': {interests: ['chess']}});42

ExercisesSet the proper 'grade' attribute for all scores. For example, users with scores greater than 90 get an 'A.' Set the grade to ‘B’ for scores falling between 80 and 90.You're being nice, so you decide to add 10 points to every score on every “final” exam whose score is lower than 60. How do you do this update?43

CRUD – Deletedb.dropDatabase();db.foo.drop();db.foo.remove();44

“Map Reduce is the Uzi of aggregation tools. Everything described with count, distinct and group can be done with MapReduce, and more.”Kristina Chadorow, Michael Dirolf in MongoDB – The Definitive Guide45

MapReduceTo use map-reduce, you first write a map function.map = function() { emit(this.user.name, {diggs: this.diggs, posts: 0});}46

MapReduceThe reduce functions then aggregation those docs by key.reduce = function(key, values) {vardiggs = 0;var posts = 0;values.forEach(function(doc) {diggs += doc.diggs; posts += 1; }); return {diggs: diggs, posts: posts};}47

MapReduceNow both are used to perform custom aggregation.db.stories.mapReduce(map, reduce, {out: 'digg_users'});48

DMDW Extra Lesson - NoSql and MongoDB

More Related Content

What's hot

Viewers also liked

Similar to DMDW Extra Lesson - NoSql and MongoDB

More from Johannes Hoppe

Recently uploaded

DMDW Extra Lesson - NoSql and MongoDB