Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Survey of Advanced Non-relational Database Systems: Approaches and Applications


Published on

  • Be the first to comment

  • Be the first to like this

A Survey of Advanced Non-relational Database Systems: Approaches and Applications

  1. 1. A Survey of Advanced Non-relational Database Systems: Approaches and Applications Speaker: LIN Qian
  2. 2. Outline• Introduction• Non-relational database system – Requirement – Concepts – Approaches – Optimization – Examples• Comparison between RDBMS and non-relational database system 1
  3. 3. Problem• The Web introduces a new scale for applications, in terms of: – Concurrent users (millions of reqs/second) – Data (peta-bytes generated daily) – Processing (all this data needs processing) – Exponential growth (surging unpredictable demands)• Shortage of existing RDBMS – Oracle, MS SQL, Sybase, MySQL, PostgreSQL, … – Trouble when dealing with very large traffic – Even with their high-end clustering solutions 2
  4. 4. Problem• Why? – Applications using normalized database schema require the use of join, which doesnt perform well under lots of data and/or nodes – Existing RDBMS clustering solutions require scale-up, which is limited and not really scalable when dealing with exponential growth (e.g., 1000+ nodes) – Machines have upper limits on capacity 3
  5. 5. Problem• Why not just use sharding? – Very complex and application-specific • Increased complexity of SQL • Single point of failure • Failover servers more complex • Backups more complex • Operational complexity added – Very problematic when adding/removing nodes – Basically, you end up denormalizing everything and loosing all benefits of relational databasesSharding: Split one or more tables by row across potentially multiple instances of theschema and database servers. 4
  6. 6. Who faced this problem?• Web applications dealing with high traffic and massive data – Web service providers • Google, Yahoo!, Amazon, Facebook, Twitter, LinkedIn, … – Scientific data analysis • Weather, Ocean, tide, geothermy, … – Complex information processing • Financial, stock, telecommunication, … 5
  7. 7. Solution• A new kind of DBMS, capable of handling web scale – Possibly sacrificing some level of feature• CAP theorem*: You can only optimize 2 out of these 3 – Consistency - the system is in a consistent state after an operation • All nodes see the same data at the same time • Strong consistency (ACID) vs. eventual consistency (BASE) – Availability - the system is “always on”, no downtime • Node failure tolerance: All clients can find some available replica. • software/hardware upgrade tolerance – Partition tolerance • The system continues to operate (read/write) despite arbitrary message loss or failure of part of the system* Eric A. Brewer, Towards Robust Distributed Systems, Proceedings of the 19th annualACM symposium on Principles of Distributed Computing (PODC), 2000 6
  8. 8. Non-relational database systems• Various solutions & products – BigTable, LevelDB (developed at Google) – Hbase (developed at Yahoo!) – Dynamo (developed at Amazon) – Cassandra (developed at FaceBook) – Voldemort (developed at LinkedIn) – Riak, Redis, CouchDB, MongoDB, Berkeley DB, …• Researches – NoDB, Walnut, LogBase, Albatross, Citrusleaf, HadoopDB – PIQL, RAMCloud 7
  9. 9. Benefits• Massively scalable• Extremely fast• Highly available, decentralized and fault tolerant – no single-point-of-failure• Transparent sharding (consistent hashing)• Elasticity• Parallel processing• Dynamic schema• Automatic conflict resolution 8
  10. 10. Cost• Allows sacrificing consistency (ACID) – at certain circumstances, but can deal with it• Non-standard new API model• Non-standard new Schema model• New knowledge required to tune/optimize• Less mature 9
  11. 11. Data/API/Schema model• Data model: Key-Value store – (row:string, column:string, time:int64) → string – An opaque serialized object• API model – Get(key) – Put(key, value) – Delete(key) – Execute(operation, key_list)• Schema model – None – Kind of sparse table 10
  12. 12. Data processing• MapReduce* – An API exposed by non-relational databases to process data – A functional programming pattern for parallelizing work – Brings the workers to the data • excellent fit for non-relational databases – Minimizes the programming to 2 simple functions • map & reduce*: Jeffrey Dean and Sanjay Ghemawat,MapReduce: Simplified Data Processingon Large Clusters, Proceedings of the6th Symposium on Operating SystemsDesign and Implementation (OSDI),2004. 11
  13. 13. Optimization: Distributed indexing• Exploits the characteristics of Cayley graphs to provide the scalability for supporting multiple distributed indexes of different types.• Define a methodology to map various types of data and P2P overlays to a generalized Cayley graph structure.• Propose self-tuning strategies to optimize the performance of the indexes defined over the generic Cayley overlay. 12
  14. 14. Optimization: Data migration• Albatross is a technique for live migration in a multitenant database which can migrate a live tenant database with no aborted transactions. – Phase 1: Begin Migration. – Phase 2: Iterative Copying. – Phase 3: Atomic Handover. 13
  15. 15. Example: Oracle Berkeley DB• High-performance embeddable database providing SQL, Java Object and Key-Value storage – Relational Storage - Support SQL. – Synchronization - extend the reach of existing applications to mobile devices by supporting unparalleled performance and a robust data store on the mobile device. – Replication - Provide a single-master multi-replica highly available database configuration. Storage engine 14
  16. 16. Example: Amazon DynamoDB• Fully managed NoSQL database service providing fast and predictable performance with seamless scalability – Provisioned throughput • Allocate dedicated resources to table to performance requirements, and automatically partitions data over a sufficient number of servers to meet request capacity. – Consistency model • The eventual consistency option maximizes read throughput. – Data Model • Attributes, Items and Tables 15
  17. 17. Example: HBase• Non-relational, distributed database running on top of HDFS providing Bigtable-like capabilities for Hadoop – Strongly consistent reads/writes – Automatic sharding – Hadoop/HDFS Integration – Block Cache and Bloom Filters – Operational Management 16
  18. 18. Example: CouchDB• Scalable, fault-tolerant, and schema-free document- oriented database – Document Storage – Distributed Architecture with Replication – Map/Reduce Views and Indexes – ACID Semantics – Eventual Consistency – Built for Offline 17
  19. 19. Example: Riak• A distributed database architected for availability, fault-tolerance, operational simplicity and scalability. – Operate in highly distributed environments – Scale simply and intelligently – Master-less – Highly fault-tolerant – Incredibly stable 18
  20. 20. Example: MongoDB• Document-oriented NoSQL database system – Scale horizontally without compromising functionality – Document-oriented storage – Full index support – Master-slave replication – Rich, document-based queries 19
  21. 21. Comparison with RDBMS• Transaction – Web apps can (usually) do without transactions / strong consistency / integrity – Bigtable does not support transactions across multiple rows • support single-row transactions • provide an interface for batching writes across row keys at the clients• Scalability – Parallel DBMS vs. MapReduce-base system 20
  22. 22. THANK YOU! 21
  23. 23. Backup 22
  24. 24. Example of the CAP theorem• When you have a lot of data which needs to be highly available, youll usually need to partition it across machines & also replicate it to be more fault-tolerant• This means, that when writing a record, all replicas must be updated too• Now you need to choose between: – Lock all relevant replicas during update => be less available – Dont lock the replicas => be less consistent 23