Overview of NoSQL

  • 1,078 views
Uploaded on

An introduction to NoSQL. Why it is needed? What are the drawbacks of RDBMS and how the NoSQL can overcome these drawbacks.

An introduction to NoSQL. Why it is needed? What are the drawbacks of RDBMS and how the NoSQL can overcome these drawbacks.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
1,078
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
43
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Tell story of RDMBSWhy RDBMS is popularWhat is the problem of RDBMSWhy need features of NoSQL
  • MongoDB is very good at real-time inserts, updates, and queries. Scalability and replication are provided which are necessary functions for large web sites' real-time data store
  • Scalability is an architectural feature of a system that can continue serving a greaternumber of requests with little degradation in performance. Vertical scaling—simplyadding more hardware capacity and memory to your existing machine—is the easiestway to achieve this. Horizontal scaling means adding more machines that have all orsome of the data on them so that no one machine has to bear the entire burden ofserving requests. But then the software itself must have an internal mechanism forkeeping its data in sync with the other nodes in the cluster.

Transcript

  • 1. Introduction to NoSQL 2011.02 Quang Nguyen
  • 2. Agenda Communicating Knowledge  New Challenges for RDBMS  Introduction to NoSQL  MongoDB Sharding 2
  • 3. Relational DBMS Communicating Knowledge  Since 1970  Use SQL to manipulate data  Easy to use  Easy to integrate with other system  Excellent for applications such as management (accounting, reservations, staff management, etc) 3
  • 4. ACID Properties of RDBMS Communicating Knowledge Databases always satisfy this four properties  Atomic: “all or nothing”, when a statement is executed, it is either successful or failed  Consistent: data moves from one correct state to another correct state  Isolated: two concurrent transaction will not become entangle with each other  Durable: one a transaction has succeeded, the change will not be lost 4
  • 5. What is problem of RDBMS? Communicating Knowledge  Schemas arent designed for sparse data  Normalize, creates a lot of tables  Joins can be prohibitively expensive  Most importantly, databases are simply not designed to be distributed. 5
  • 6. An Example of a Distributed DB Communicating Knowledge  A banking system consisting of 4 branches in four different city. Each branch maintains accounts locally Account = (account-number, branch, balance)  One single site that maintains information about branches Branch = (branch-name, city, assets) 6
  • 7. An Example of a Distributed DB Communicating Knowledge Transfer $1000 Transaction From A:$3000 coordinator To B:$2000 client Bank A Bank B  Clients want all-or-nothing transactions  Transfer either happens or not at all 7
  • 8. An Example of a Distributed DB Communicating Knowledge  Simple solution client transaction bank A bank B coordinator start A=A-1000 done B=B+1000  What can go wrong?  A does not have enough money  B’s account no longer exists  B has crashed  Coordinator crashes 8
  • 9. An Example of a Distributed DB Communicating Knowledge  Two-phase Commit Protocol (2PC) client transaction bank A bank B coordinator start Locked prepare prepare rA Loss of ravailability and B outcome result higher latency! outcome If rA==yes && rB==yes outcome = “commit” B commits upon else receiving “commit” outcome = “abort” 9
  • 10. Schemas vs. Schema-free Communicating Knowledge  Use tables to represent real objects  Join operation is expensive and difficult to be executed in horizontal scale-out Name Surname Home Mobile Telephone Office Marital - Status Quang Nguyen Null 398 Null Null Null null Cuong Trinh Nguyen 999 555 null null null Dinh Chieu - - - - - - - - user :{ user :{ name: Cuong, name: quang, surname: Trinh, surname: Nguyen, Home: Nguyen Dinh Chieu, mobile : 398 mobile : 999, } Telephone: 555, } 10
  • 11. Communicating KnowledgeNew Trends and Requirements 11
  • 12. Information amount is growing fast Communicating Knowledge  In 2010, the amount of information created and replicated first time exceeded zettabytes (trillion gigabytes). In 2011, it surpass 1.8 zettabytes 12
  • 13. Google: BigTable Communicating Knowledge  Web Indexing  Google Earth  Youtube  Google Books  Google Mail High Scalability High Availability 13
  • 14. Amazon: DynamoDB Communicating Knowledge  RDBMS doesn’t fit requirements  10 of thousands servers around the world  10 million customers High Reliability High Availability 14
  • 15. Facebook: Cassandra, HBase Communicating Knowledge  People  High Scalability More than 800 million active users  High Availability More than 50% of our active users log on to Facebook in any given day  Average user has 130 friends  Activity  More than 900 million objects that people interact with (pages, groups, events and community pages)  On average, more than 250 million photos are uploaded per day  Messaging system including chat, wall posts, and email has 135+ billion messages per month 15
  • 16. Twitter Communicating Knowledge High Availability 16
  • 17. CAP Theorem Communicating Knowledge It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees  Consistency: all nodes see the same data at the same time  Availability: every request receives a response about whether it was successful or failed  Partition Tolerance: the system continues to operate despite arbitrary message loss You have to choose only two. In almost all cases, you would choose availability over consistency 17
  • 18. Consistency Level Communicating Knowledge  Strong (Sequential): After the update completes any subsequent access will return the updated value.  Weak (weaker than Sequential): The system does not guarantee that subsequent accesses will return the updated value.  Eventual: All updates will propagate throughout all of the replicas in a distributed system, but that this may take some time. Eventually, all replicas will be consistent. 18
  • 19. What is NoSQL Communicating Knowledge  Stands for Not Only SQL  Class of non-relational data storage systems  Usually do not require a fixed table schema nor do they use the concept of joins  All NoSQL offerings relax one or more of the ACID properties NoSQL != 19
  • 20. What is NoSQL Communicating Knowledge 20
  • 21. NoSQL Features Communicating Knowledge  Key/Value stores or “the big hash table”  Amazon S3 (Dynamo)  Memcached  Schema-less, which comes in multiple flavors  Document-based (MongoDB, CouchDB)  Column-based (Cassandra, Hbase)  Graph-based (neo4j) 21
  • 22. Key/Value Communicating Knowledge  Advantages  Very fast  Very scalable  Simple model  Able to distribute horizontally  Disadvantages  Many data structures (objects) cant be easily modeled as key value pairs 22
  • 23. Schema-less Communicating Knowledge  Advantages  Schema-less data model is richer than key/value pairs  Eventual consistency  Many are distributed  Still provide excellent performance and scalability  Disadvantages  no ACID transactions 23
  • 24. Memcached Communicating Knowledge 24
  • 25. Communicating Knowledge25
  • 26. Introduction to MongoDB Communicating Knowledge  MongoDB is document-oriented database  Key -> Document  Structured Document  Schema-free user :{ name: quang, Key = quang surname: Nguyen, mobile : 398 } user :{ name: Cuong, surname: Trinh, Key = cuong Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555, } 26
  • 27. Introduction to MongoDB Communicating Knowledge Result count: 1 user :{ name: quang, Query = quang surname: Nguyen, mobile : 398 } Result count: 1 user :{ name: Cuong, surname: Trinh, Query =cuong Home: Nguyen Dinh Chieu, mobile : 999, Telephone: 555, } 27
  • 28. Features of Mongo DB Communicating Knowledge  Indexing  Stored JavaScript  Aggregation  File Storage  Make Scaling out easier  Scaling out vs. Scaling up  Scaling out is done automatically, balanced across a cluster 28
  • 29. Some applications of MongoDB Communicating Knowledge  Large scale application  Archiving and event logging  Document and Content Management Systems foursquare uses MongoDB to store venues and user "check-ins" into venues, sharding the data over more than 25 machines on Amazon EC2 Craigslist uses MongoDB to archive billions of records Disney built a common set of tools and APIs for all games within the Interactive Media Group, using MongoDB as a common object repository to persist state information 29
  • 30. Communicating Knowledge30
  • 31. Introduction to Cassandra Communicating Knowledge  Column Family: logical division that associate similar data. E.g., User Column Family, Hotel Column Family.  Row oriented: each row doesn’t need to have all the same columns as other rows like it (as in a relational model).  Schema-Free 31
  • 32. Introduction to Cassandra Communicating Knowledge Result count: 3 -(Column = name, value =quang, timestamp=32345632) Query = quang -(column=surname, value=Nguyen, timestamp=12345678) -(column=mobile, value=398, timestamp=33592839) Result count: 5 -(column=name, value=Cuong, timestamp=33434343) -(column=surname, value=Trinh, timestamp=34568258) -(column=Home, value=Nguyen Dinh Chieu, Query = cuong timestamp=54542368) -(column=mobile, value=999, timestamp=23445486) -(column=Telephone, value=555, timestamp=34314642) 32
  • 33. Features of Cassandra Communicating Knowledge  Distributed and Decentralized  Some nodes need to be set up as masters in order to organize other nodes, which are set up as slaves  That there is no single point of failure  High Availability & Fault Tolerance  You can replace failed nodes in the cluster with no downtime, and you can replicate data to multiple data centers to offer improved local performance and prevent downtime if one data center experiences a catastrophe such as fire or flood.  Tunable Consistency  It allows you to easily decide the level of consistency you require, in balance with the level of availability 33
  • 34. Features of Cassandra Communicating Knowledge  Elastic Scalability  Elastic scalability refers to a special property of horizontal scalability. It means that your cluster can seamlessly scale up and scale back down. 34
  • 35. Some Applications of Cassandra Communicating Knowledge  Large Deployments  Lots of Writes, Statistics, and Analysis  Geographical Distribution Facebook used Cassandra to power Inbox Search, with over 200 nodes deployed Twitter announced it is planning to use Cassandra because it can be run on large server clusters and is capable of taking in very large amounts of data at a time AppScale uses Cassandra as a back-end for Google App Engine applications 35
  • 36. Communicating Knowledge36
  • 37. Neo4j – Graph Database Communicating Knowledge  Data is stored as a Graph/Network  Nodes and relationships with properties  Schema-free people :{ KNOWS people :{ name: quang, KNOWS name: Cuong, surname: Nguyen} surname: Trinh, hobbies: uncountable} KNOWS KNOWS WORKS people:{ OWNS name: Thanh, Company:{ surname: Nguyen} Company:{ name: Saltlux, Vietnam name: TechMaster, WORKS Area: SearchEngine} area: IT Education, Company:{ founded: 2011} name: Fami, area: Furniture} 37
  • 38. Neo4j – Graph Database Communicating Knowledge  Find all persons that KNOWS a friend that KNOWS someone called “Larry Ellison” SELECT ?person WHERE { ?person neo4j:KNOWS ?friend . ?friend neo4j:KNOWS ?foe . ?foe neo4j:name "Larry Ellison" . } 38
  • 39. Features of Neo4j Communicating Knowledge  Disk-based  Fully transactional like a real database (ACID is satisfied)  Scale-up, massive scalability. Neo4j can handle graphs of several billion nodes/ relationships/ properties on a single machine.  No sharding 39
  • 40. Some Applications of Neo4j Communicating Knowledge  Ideal for any application that relies on the relationships between records  Social Networks  Recommendations 40
  • 41. Communicating KnowledgeMongoDB Sharding 41
  • 42. Some Considerations Communicating Knowledge  If you want to store a large volume of data or access to it at a higher rate higher than a single server can handle?  More servers are added, what is the dependency between servers  Can your application handle if one server/subset of servers crashes?  What if communication has problems? 42
  • 43. What is sharding Communicating Knowledge  Sharding is the method MongoDB uses to split a large collection across server servers (called cluster)  MongoDB does almost everything automatically; MongoDB lets your application grow – easily, robustly, and natually  Making the cluster “invisible”  Making the cluster always available for reads and writes  Let the cluster grow easily 43
  • 44. A Shard Communicating Knowledge  A shard is one or more servers in a cluster that are responsible for some subset of the data  A shard can consist of many servers. If there is more than one server in a shard, each server has identical copy of the subset of the data abc abc abc Shard abc 44
  • 45. Distributing Data – One range per shard Communicating Knowledge  One range per shard [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”,”{”) Shard 1 Shard 2 Shard 3 Shard 4  Data movement issue [“c”, “f”) [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”,”{”) Shard 1 Shard 2 Shard 3 Shard 4 [“a”, “c”) [“c”, “n”) [“n”, “t”) [“t”,”{”) Shard 1 Shard 2 Shard 3 Shard 4 45
  • 46. Distributing Data – One range per shard Communicating Knowledge  Data has to be moved across the cluster 500 GB 500 GB 300 GB 300 GB 100 GB 400 GB 400GB Data 600 GB 300 GB 300 GB Movement 200 GB 400 GB 400 GB 500 GB 300 GB 100 GB 400 GB 400 GB 400 GB 400 GB 46
  • 47. Distributing Data – One range per shard Communicating Knowledge  It’s worse when a new shard is added 500 GB 500 GB 500 GB 500 GB 0 GB 1 TB Data Movement 100 GB 200 GB 300 GB 400 GB 400 GB 400 GB 400 GB 400 GB 400 GB 47
  • 48. Distributing Data – Multi range shards Communicating Knowledge  Each shard can contain multiple ranges. Each range of data is called a chunk. 500 GB 500 GB 300 GB 300 GB [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”, “{“) 100 GB, [“d”, “f”) 100 GB, [“j”, “n”) 500 GB 500 GB 300 GB 300 GB [“a”, “f”) [“f”, “n”) [“n”, “t”) [“t”, “{“) 400 GB 400 GB 400 GB 400 GB [“a”, “d”) [“n”, “t”); [“t”, “{“); [“f”, “j”) [“d”, “f”) [“j”, “n”) 48
  • 49. Sharding a collection Communicating Knowledge  Key (Shard Key) is used for chunk ranges. Shard key is of any types null < numbers < strings < objects < arrays < binary data < ObjectIds < boolean < dates < regular expression  MongoDB first creates a (-∞, + ∞) chunk for a collection  If we add more data, MongoDB would split existing chunks to create new ones  Every chunk range must be distinc, not overlapped with other chunk range  Data movement is resource-consuming, a chunk is only 200MB by default 49
  • 50. Balancing Communicating Knowledge  MongoDB automatically moves chunks from one shard to another in order to  keep the data evenly distributed and  minimize the data movement. A shard must have at least 09 more chunks than the least populous chunk 50
  • 51. Choose a Sharding Key Communicating Knowledge  Avoid low-cardinality sharding key  Continent value: “Asia”, “Australia”, ”Europe”,”North America”, or “South America”  MongoDB can’t split these chunks any further! The chunks will just keep getting bigger and bigger.  Ascending key does not work as well as we expect.  Use timestamp as sharding key  Everything is added to the last chunk 51
  • 52. Choose a Sharding Key Communicating Knowledge  Random Shard key  Waste of index  So, we want to choose a shard key with nice data locality, but not so local that we end up with a hot spot. 52
  • 53. When to shard Communicating Knowledge  In general, you should start with a nonsharded setup and convert it to a sharded one, if and when you need.  Run out of disk space on your current machine.  Want to write data faster than a single process can handle.  Want to keep a larger proportion of data in memory to improve performance. 53
  • 54. Communicating KnowledgeThank you! 54
  • 55. Communicating Knowledge55