3. Relational DBs
● Born in the 70s
– storage is expensive
– schemas are simple
● Based on Relational Model
– Mathematical model for describing data structure
– Data represented in „tuples“, grouped into „relations“
● Queries based on Relational Algebra
– union, intersection, difference, cartesian product, selection,
projection, join, division
● Constraints
– Foreign Keys, Primary Keys, Indexes
– Domain Integrity (DataTypes)
8. Relational DBs - Transactions
● Atomicity
– If one part of the transaction fails, the whole transaction fails
● Consistency
– Transaction leaves the DB in a valid state
● Isolation
– One transaction doesn't see an intermediate state of the other
● Durability
– Transaction gets persisted
10. NoSQL – Why?
● Web2.0
– Huge DataVolumes
– Need for Speed
– Accesibility
● RDBMS are difficult to scale
● Storage gets cheap
● Commodity machines get cheap
11. NoSQL – What?
● Simple storage of data
● Looser consistency model (eventual consistency), in
order to achieve:
– higher availability
– horizontal scaling
● No JOINs
● Optimized for big data, when no relational features are
needed
14. Eventual Consistency
● RDBMS: all users see a consistent view
of the data
● ACID gets difficult when distributing
data across nodes
● Eventual Consistency: inconsistencies
are transitory. The DB may have some
inconsistencies at a point of time, but will
eventually get consistent.
● BASE (in contrast to ACID)– Basically
Available Soft-state Eventually
15. CAP Theorem
All nodes see
the same data
at the same time
Requests always
get an immediate response
System continues to work,
even if a part of it breaks
16. NoSQL - History
● Term first used in 1998 by C. Strozzi to name
his RelationalDB that didn't use SQL
● Term reused in 2009 by E.Evans to name the
distributed Dbs that didn't provide ACID
● Some people traduce it as „Not Only SQL“
● Should actually be called „NoRel“ (no
Relational)
17. NoSQL – Some Features
● Auto-Sharding
● Replication
● Caching
● Dynamic Schema
18. NoSQL - Types
● Document
– „Map“ key-value, with a „Document“ (xml, json, pdf, ..) as
value
– MongoDB, CouchDB
● Key-Value
– „Map“ key-value, with an „Object“ (Integer, String, Order, ..)
as value
– Cassandra, Dynamo, Voldemort
● Graph
– Data stored in a graph structure – nodes have pointer to
adjacent ones
– Neo4J
19. MongoDB
● OpenSource NoSQL Document DB written in
C++
● Started in 2009
● Commercial Support by 10gen
● From humongous (huge)
● http://www.mongodb.org/
20. MongoDB – Document Oriented
● No Document Structure - schemaless
● Atomicity: only at document level (no
transactions across documents)
● Normalization is not easy to achieve:
– Embed: +duplication, +performance
– Reference: -duplication, +roundtrips
26. SQL → Mongo Mapping (I)
SQL Statement Mongo Query Language
CREATE TABLE USERS (a Number, b
Number)
implicit
INSERT INTO USERS VALUES(1,1) db.users.insert({a:1,b:1})
SELECT a,b FROM users db.users.find({}, {a:1,b:1})
SELECT * FROM users db.users.find()
SELECT * FROM users WHERE age=33 db.users.find({age:33})
SELECT * FROM users WHERE age=33
ORDER BY name
db.users.find({age:33}).sort({name:1})
27. SQL → Mongo Mapping (I)
SQL Statement Mongo Query Language
SELECT * FROM users WHERE age>33 db.users.find({'age':{$gt:33}})})
CREATE INDEX myindexname ON
users(name)
db.users.ensureIndex({name:1})
SELECT * FROM users WHERE a=1 and
b='q'
db.users.find({a:1,b:'q'})
SELECT * FROM users LIMIT 10 SKIP 20 db.users.find().limit(10).skip(20)
SELECT * FROM users LIMIT 1 db.users.findOne()
EXPLAIN PLAN FOR SELECT * FROM users
WHERE z=3
db.users.find({z:3}).explain()
SELECT DISTINCT last_name FROM users db.users.distinct('last_name')
SELECT COUNT(*)
FROM users where AGE > 30
db.users.find({age: {'$gt': 30}}).count()
31. MongoDB – Replication (I)
● Master-slave replication: primary and secondary nodes
● replica set: cluster of mongod instances that replicate amongst one
another and ensure automated failover
WriteConcern
32. MongoDB – Replication (II)
● adds redundancy
● helps to ensure high availability – automatic
failover
● simplifies backups
33. WriteConcerns
● Errors Ignored
– even network errors are ignored
● Unacknowledged
– at least network errors are handled
● Acknowledged
– constraints are handled (default)
● Journaled
– persisted to journal log
● Replica ACK
– 1..n
– Or 'majority'
34. MongoDB – Sharding (I)
● Scale Out
● Distributes data to nodes automatically
● Balances data and load accross machines
35. MongoDB – Sharding (II)
● A sharded Cluster is composed of:
– Shards: holds data.
● Either one mongod instance (primary daemon process –
handles data requests), or a replica set
– config Servers:
● mongod instance holding cluster metadata
– mongos instances:
● route application calls to the shards
● No single point of failure
38. MongoDB – Sharding (V)
● Collection has a shard key: existing field(s) in
all documents
● Documents get distributed according to ranges
● In a shard, documents are partitioned into
chunks
● Mongo tries to keep all chunks at the same size
39. MongoDB – Sharding (VI)
● Shard Balancing
– When a shard has too many chunks, mongo moves
chunks to other shards
● Only makes sense with huge amount of data
41. Jongo - Example
DB db = new MongoClient().getDB("jongo");
Jongo jongo = new Jongo(db);
MongoCollection users = jongo.getCollection("users");
User user = new User("ruben", "inoto", new Address("Musterstraße", "5026"));
users.save(user);
User ruben = users.findOne("{name: 'ruben'}").as(User.class);
public class User {
private String name;
private String surname;
private Address address;
public class Address {
private String street;
private String zip;
{
"_id" : ObjectId("51b0e1c4d78a1c14a26ada9e"),
"name" : "ruben",
"surname" : "inoto",
"address" : {
"street" : "Musterstraße",
"zip" : "5026"
}
}
42. TTL (TimeToLive)
● Data with an expiryDate
● After the specified TimeToLive, the data will be
removed from the DB
● Implemented as an Index
● Useful for logs, sessions, ..
db.broadcastMessages.ensureIndex( { "createdAt": 1 }, { expireAfterSeconds: 3600 } )
43. MapReduce
● Programming model for processing large data sets with a
parallel, distributed algorithm.
● Handles complex aggregation tasks
● Problem can be distributed in smaller tasks, distributed across
nodes
● map phase: selects the data
– Associates a value with a key and a value pair
– Values will be grouped by the key, and passed to the reduce function
● reduce phase: transforms the data
– Accepts two arguments: key and values
– Reduces to a single object all the values associated with the key
53. Conclusion
● Not a silver bullet
● Makes sense when:
– Eventual consistency is acceptable
– Prototyping
– Performance
– Object model doesn't suit in a Relational DB
● Easy to learn