A Survey of Advanced Non-relational Database    Systems: Approaches and Applications             Speaker: LIN Qian  http:/...
Outline• Introduction• Non-relational database system  –   Requirement  –   Concepts  –   Approaches  –   Optimization  – ...
Problem• The Web introduces a new scale for applications, in  terms of:   –   Concurrent users (millions of reqs/second)  ...
Problem• Why?  – Applications using normalized database schema require the    use of join, which doesnt perform well under...
Problem• Why not just use sharding?    – Very complex and application-specific         •   Increased complexity of SQL    ...
Who faced this problem?• Web applications dealing with high traffic and massive  data   – Web service providers      • Goo...
Solution• A new kind of DBMS, capable of handling web scale    – Possibly sacrificing some level of feature• CAP theorem*:...
Non-relational database systems• Various solutions & products   –   BigTable, LevelDB (developed at Google)   –   Hbase (d...
Benefits• Massively scalable• Extremely fast• Highly available, decentralized and fault tolerant   – no single-point-of-fa...
Cost• Allows sacrificing consistency (ACID)   – at certain circumstances, but can deal with it• Non-standard new API model...
Data/API/Schema model• Data model: Key-Value store   – (row:string, column:string, time:int64) → string   – An opaque seri...
Data processing• MapReduce*     – An API exposed by non-relational databases to process data     – A functional programmin...
Optimization: Distributed indexing• Exploits the characteristics of Cayley graphs to provide the scalability for  supporti...
Optimization: Data migration• Albatross is a technique for live migration in a  multitenant database which can migrate a l...
Example: Oracle Berkeley DB• High-performance embeddable database providing  SQL, Java Object and Key-Value storage  – Rel...
Example: Amazon DynamoDB• Fully managed NoSQL database service providing fast  and predictable performance with seamless s...
Example: HBase• Non-relational, distributed database running on top of  HDFS providing Bigtable-like capabilities for Hado...
Example: CouchDB• Scalable, fault-tolerant, and schema-free document-  oriented database   –   Document Storage   –   Dist...
Example: Riak• A distributed database architected for availability,  fault-tolerance, operational simplicity and scalabili...
Example: MongoDB• Document-oriented NoSQL database system  –   Scale horizontally without compromising functionality  –   ...
Comparison with RDBMS• Transaction   – Web apps can (usually) do without transactions / strong     consistency / integrity...
THANK YOU!             21
Backup         22
Example of the CAP theorem• When you have a lot of data which needs to be highly  available, youll usually need to partiti...
Upcoming SlideShare
Loading in …5
×

A Survey of Advanced Non-relational Database Systems: Approaches and Applications

495
-1

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
495
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Machines have upper limits on capacity
  • Increased complexity of SQL - Increased bugs because the developers have to write more complicated SQL to handle sharding logic.Single point of failure - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table.Failover servers more complex - Failover servers must themselves have copies of the fleets of database shards.Backups more complex - Database backups of the individual shards must coordinated with the backups of the other shards.Operational complexity added - Adding/removing indexes, adding/deleting columns, modifying the schema become much more difficult.
  • Requirements to distributed systems
  • A Survey of Advanced Non-relational Database Systems: Approaches and Applications

    1. 1. A Survey of Advanced Non-relational Database Systems: Approaches and Applications Speaker: LIN Qian http://www.comp.nus.edu.sg/~linqian
    2. 2. Outline• Introduction• Non-relational database system – Requirement – Concepts – Approaches – Optimization – Examples• Comparison between RDBMS and non-relational database system 1
    3. 3. Problem• The Web introduces a new scale for applications, in terms of: – Concurrent users (millions of reqs/second) – Data (peta-bytes generated daily) – Processing (all this data needs processing) – Exponential growth (surging unpredictable demands)• Shortage of existing RDBMS – Oracle, MS SQL, Sybase, MySQL, PostgreSQL, … – Trouble when dealing with very large traffic – Even with their high-end clustering solutions 2
    4. 4. Problem• Why? – Applications using normalized database schema require the use of join, which doesnt perform well under lots of data and/or nodes – Existing RDBMS clustering solutions require scale-up, which is limited and not really scalable when dealing with exponential growth (e.g., 1000+ nodes) – Machines have upper limits on capacity 3
    5. 5. Problem• Why not just use sharding? – Very complex and application-specific • Increased complexity of SQL • Single point of failure • Failover servers more complex • Backups more complex • Operational complexity added – Very problematic when adding/removing nodes – Basically, you end up denormalizing everything and loosing all benefits of relational databasesSharding: Split one or more tables by row across potentially multiple instances of theschema and database servers. 4
    6. 6. Who faced this problem?• Web applications dealing with high traffic and massive data – Web service providers • Google, Yahoo!, Amazon, Facebook, Twitter, LinkedIn, … – Scientific data analysis • Weather, Ocean, tide, geothermy, … – Complex information processing • Financial, stock, telecommunication, … 5
    7. 7. Solution• A new kind of DBMS, capable of handling web scale – Possibly sacrificing some level of feature• CAP theorem*: You can only optimize 2 out of these 3 – Consistency - the system is in a consistent state after an operation • All nodes see the same data at the same time • Strong consistency (ACID) vs. eventual consistency (BASE) – Availability - the system is “always on”, no downtime • Node failure tolerance: All clients can find some available replica. • software/hardware upgrade tolerance – Partition tolerance • The system continues to operate (read/write) despite arbitrary message loss or failure of part of the system* Eric A. Brewer, Towards Robust Distributed Systems, Proceedings of the 19th annualACM symposium on Principles of Distributed Computing (PODC), 2000 6
    8. 8. Non-relational database systems• Various solutions & products – BigTable, LevelDB (developed at Google) – Hbase (developed at Yahoo!) – Dynamo (developed at Amazon) – Cassandra (developed at FaceBook) – Voldemort (developed at LinkedIn) – Riak, Redis, CouchDB, MongoDB, Berkeley DB, …• Researches – NoDB, Walnut, LogBase, Albatross, Citrusleaf, HadoopDB – PIQL, RAMCloud 7
    9. 9. Benefits• Massively scalable• Extremely fast• Highly available, decentralized and fault tolerant – no single-point-of-failure• Transparent sharding (consistent hashing)• Elasticity• Parallel processing• Dynamic schema• Automatic conflict resolution 8
    10. 10. Cost• Allows sacrificing consistency (ACID) – at certain circumstances, but can deal with it• Non-standard new API model• Non-standard new Schema model• New knowledge required to tune/optimize• Less mature 9
    11. 11. Data/API/Schema model• Data model: Key-Value store – (row:string, column:string, time:int64) → string – An opaque serialized object• API model – Get(key) – Put(key, value) – Delete(key) – Execute(operation, key_list)• Schema model – None – Kind of sparse table 10
    12. 12. Data processing• MapReduce* – An API exposed by non-relational databases to process data – A functional programming pattern for parallelizing work – Brings the workers to the data • excellent fit for non-relational databases – Minimizes the programming to 2 simple functions • map & reduce*: Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processingon Large Clusters, Proceedings of the6th Symposium on Operating SystemsDesign and Implementation (OSDI),2004. 11
    13. 13. Optimization: Distributed indexing• Exploits the characteristics of Cayley graphs to provide the scalability for supporting multiple distributed indexes of different types.• Define a methodology to map various types of data and P2P overlays to a generalized Cayley graph structure.• Propose self-tuning strategies to optimize the performance of the indexes defined over the generic Cayley overlay. 12
    14. 14. Optimization: Data migration• Albatross is a technique for live migration in a multitenant database which can migrate a live tenant database with no aborted transactions. – Phase 1: Begin Migration. – Phase 2: Iterative Copying. – Phase 3: Atomic Handover. 13
    15. 15. Example: Oracle Berkeley DB• High-performance embeddable database providing SQL, Java Object and Key-Value storage – Relational Storage - Support SQL. – Synchronization - extend the reach of existing applications to mobile devices by supporting unparalleled performance and a robust data store on the mobile device. – Replication - Provide a single-master multi-replica highly available database configuration. Storage engine 14
    16. 16. Example: Amazon DynamoDB• Fully managed NoSQL database service providing fast and predictable performance with seamless scalability – Provisioned throughput • Allocate dedicated resources to table to performance requirements, and automatically partitions data over a sufficient number of servers to meet request capacity. – Consistency model • The eventual consistency option maximizes read throughput. – Data Model • Attributes, Items and Tables 15
    17. 17. Example: HBase• Non-relational, distributed database running on top of HDFS providing Bigtable-like capabilities for Hadoop – Strongly consistent reads/writes – Automatic sharding – Hadoop/HDFS Integration – Block Cache and Bloom Filters – Operational Management 16
    18. 18. Example: CouchDB• Scalable, fault-tolerant, and schema-free document- oriented database – Document Storage – Distributed Architecture with Replication – Map/Reduce Views and Indexes – ACID Semantics – Eventual Consistency – Built for Offline 17
    19. 19. Example: Riak• A distributed database architected for availability, fault-tolerance, operational simplicity and scalability. – Operate in highly distributed environments – Scale simply and intelligently – Master-less – Highly fault-tolerant – Incredibly stable 18
    20. 20. Example: MongoDB• Document-oriented NoSQL database system – Scale horizontally without compromising functionality – Document-oriented storage – Full index support – Master-slave replication – Rich, document-based queries 19
    21. 21. Comparison with RDBMS• Transaction – Web apps can (usually) do without transactions / strong consistency / integrity – Bigtable does not support transactions across multiple rows • support single-row transactions • provide an interface for batching writes across row keys at the clients• Scalability – Parallel DBMS vs. MapReduce-base system 20
    22. 22. THANK YOU! 21
    23. 23. Backup 22
    24. 24. Example of the CAP theorem• When you have a lot of data which needs to be highly available, youll usually need to partition it across machines & also replicate it to be more fault-tolerant• This means, that when writing a record, all replicas must be updated too• Now you need to choose between: – Lock all relevant replicas during update => be less available – Dont lock the replicas => be less consistent 23
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×