No sql exploration keyvaluestore

1,121 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,121
On SlideShare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

No sql exploration keyvaluestore

  1. 1. NoSQL........................................................................................................................2 Why NoSQL?..........................................................................................................2 NoSQL Categories..................................................................................................2 Relational Vs NoSQL Databases............................................................................2 Why Key/value store?.............................................................................................3 Memcached (Key/value store on memory).............................................................4 Memcachedb (Key/Value store on disk).................................................................8 BerkeleyDB...........................................................................................................11 Document Stores.......................................................................................................14 Other Info..................................................................................................................15
  2. 2. NoSQL NOT only SQL. It’s not about saying that SQL should never be used, or that SQL is dead… it’s about recognizing that for some problems other storage solutions are better suited. Why NoSQL? Trends that gave way for NoSQL paradigm  Exploding Data Size – Each year more and more digital data is created. Over two years we create more digital data than all the data created in history before that.  Increasing Connectedness – Over time data has evolved to be more and more interlinked and connected. Hypertext has links, Blogs have pingback, Tagging groups all related data.  Semi-structure – Individualization of content, Store more about each entity, Acceleration of decentralized content generation (web 2.0)  Architecture – Moving towards decoupled services with their own backend Sources: http://www.slideshare.net/novelys/nosql-3272395 http://www.slideshare.net/marin_dimitrov/nosql-databases-3584443 http://www.slideshare.net/thobe/nosql-for-dummies NoSQL Categories NoSQL Products Relational Vs NoSQL Databases
  3. 3. Key/value store Why Key/value store? Even though RDBMS have provided database users with the best mix of simplicity, robustness, flexibility, performance, scalability, and compatibility, their performance in each of these areas is not necessarily better than that of an alternate solution pursuing one of these benefits in isolation. This has not been much of a problem so far because the universal dominance of RDBMS has outweighed the need to push any of these boundaries. Nonetheless, if you really had a need that couldn't be answered by a generic relational database, alternatives have always been around to fill those niches.
  4. 4. Today, we are in a slightly different situation. For an increasing number of applications, one of these benefits is becoming more and more critical; and while still considered a niche, it is rapidly becoming mainstream, so much so that for an increasing number of database users this requirement is beginning to eclipse others in importance. That benefit is scalability. As more and more applications are launched in environments that have massive workloads, such as web services, their scalability requirements can, first of all, change very quickly and, secondly, grow very large. The first scenario can be difficult to manage if you have a relational database sitting on a single in-house server. For example, if your load triples overnight, how quickly can you upgrade your hardware? The second scenario can be too difficult to manage with a relational database in general. Relational databases scale well, but usually only when that scaling happens on a single server node. When the capacity of that single node is reached, you need to scale out and distribute that load across multiple server nodes. This is when the complexity of relational databases starts to rub against their potential to scale. Try scaling to hundreds or thousands of nodes, rather than a few, and the complexities become overwhelming, and the characteristics that make RDBMS so appealing drastically reduce their viability as platforms for large distributed systems. For cloud services to be viable, vendors have had to address this limitation, because a cloud platform without a scalable data store is not much of a platform at all. So, to provide customers with a scalable place to store application data, vendors had only one real option. They had to implement a new type of database system that focuses on scalability, at the expense of the other benefits that come with relational databases. These efforts, combined with those of existing niche vendors, have led to the rise of a new breed of database management system. Source:http://www.slideshare.net/marc.seeger/keyvalue-stores-a-practical-overview Memcached (Key/value store on memory) Definition Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. Memcached is an in-memory key-value store for small chunks of arbitrary data (strings, objects) from results of database calls, API calls, or page rendering.
  5. 5. Memcached is simple yet powerful. Its simple design promotes quick deployment, ease of development, and solves many problems facing large data caches. Its API is available for most popular languages. What is it made up of?  Client software, which is given a list of available memcached servers.  A client-based hashing algorithm, which chooses a server based on the "key" input.  Server software, which stores your values with their keys into an internal hash table.  Server algorithms, which determine when to throw out old data (if out of memory), or reuse memory. What are the Design Philosophies? Simple Key/Value Store The server does not care what your data looks like. Items are made up of a key, an expiration time, optional flags, and raw data. It does not understand data structures; you must upload data that is pre-serialized. Some commands (incr/decr) may operate on the underlying data, but the implementation is simplistic. Smarts Half in Client, Half in Server A "memcached implementation" is implemented partially in a client, and partially in a server. Clients understand how to send items to particular servers, what to do when it cannot contact a server, and how to fetch keys from the servers. The servers understand how to receive items, and how to expire them. Servers are Disconnected From Each Other Memcached servers are generally unaware of each other. There is no crosstalk, no synchronization, no broadcasting. The lack of interconnections means adding more servers will usually add more capacity as you expect. There might be exceptions to this rule, but they are exceptions and carefully regarded. O(1) Everything For everything it can, memcached commands are O(1). Each command takes roughly the same amount of time to process every time, and should not get noticably slower anywhere. This goes back to the "Simple K/V Store" principle, as you don't want to be processing data in the cache service your tens or hundreds or thousands of webservers may need to access at the same time. Forgetting Data is a Feature Memcached is, by default, a Least Recently Used cache. It is designed to have items expire after a specified amount of time. Both of these are elegant solutions to many problems; Expire items after a minute to limit stale data being returned, or flush unused data in an effort to retain frequently requested information.
  6. 6. This further allows great simplification in how memcached works. No "pauses" waiting for a garbage collector ensures low latency, and free space is lazily reclaimed. Cache Invalidation is a Hard Problem Given memcached's centralized-as-a-cluster nature, the job of invalidating a cache entry is trivial. Instead of broadcasting data to all available hosts, clients direct in on the exact location of data to be invalidated. You may further complicate matters to your needs, and there are caveats, but you sit on a strong baseline. Architecture The system uses client–server architecture. The servers maintain a key–value associative array; the clients populate this array and query it. Keys are up to 250 bytes long and values can be at most 1 megabyte large. Clients use client side libraries to contact the servers which, by default, expose their service at port 11211. Each client knows all servers; the servers do not communicate with each other. If a client wishes to set or read the value corresponding to a certain key, the client's library first computes a hash of the key to determine the server that will be used. Then it contacts that server. The server will compute a second hash of the key to determine where to store or read the corresponding value. The servers keep the values in RAM; if a server runs out of RAM, it discards the oldest values. Therefore, clients must treat Memcached as a transitory cache; they cannot assume that data stored in Memcached is still there when they need it. A Memcached-protocol compatible product known as MemcacheDB provides persistent storage. There is also a solution called Membase from NorthScale that provides persistence, replication and clustering. If all client libraries use the same hashing algorithm to determine servers, then clients can read each other's cached data; this is obviously desirable. A typical deployment will have several servers and many clients. However, it is possible to use Memcached on a single computer, acting simultaneously as client and server. http://memcached.org/ How this stuff works? a.k.a “The Memcache Pattern “ (http://code.google.com/appengine/docs/python/memcache/usingmemcache.html#Pattern) Memcache is typically used with the following pattern: • The application receives a query from the user or the application. • The application checks whether the data needed to satisfy that query is in memcache.
  7. 7. o If the data is in memcache, the application uses that data. o If the data is not in memcache, the application queries the datastore and stores the results in memcache for future requests. The pseudocode below represents a typical memcache request: def get_data(): data = memcache.get("key") if data is not None: return data else: data = self.query_for_data() memcache.add("key", data, 60) return data Memcached allows you to take memory from parts of your system where you have more than you need and make it accessible to areas where you have less than you need. Memcached also allows you to make better use of your memory. If you consider the diagram to the right, you can see two deployment scenarios: 1. Each node is completely independent (top). 2. Each node can make use of memory from other nodes (bottom). The first scenario illustrates the classic deployment strategy, however you'll find that it's both wasteful in the sense that the total cache size is a fraction of the actual capacity of your web farm, but also in the amount of effort required to keep the cache consistent across all of those nodes. With memcached, you can see that all of the servers are looking into the same virtual pool of memory. This means that a given item is always stored and always retrieved from the same location in your entire web cluster. Also, as the demand for your application grows to the point where you need to have more servers, it generally also grows in terms of the data that must be regularly accessed. A deployment strategy where these two aspects of your system scale together just makes sense.
  8. 8. The illustration to the right only shows two web servers for simplicity, but the property remains the same as the number increases. If you had fifty web servers, you'd still have a usable cache size of 64MB in the first example, but in the second, you'd have 3.2GB of usable cache. Of course, you aren't required to use your web server's memory for cache. Many memcached users have dedicated machines that are built to only be memcached servers. Users of Memcached LiveJournal, Wikipedia, Flickr, Bebo, Twitter, Typepad, Yellowbot, Youtube, Digg, Wordpress, Craigslist, Mixi Memcachedb (Key/Value store on disk) Definition (Wiki: http://en.wikipedia.org/wiki/Memcachedb) is a persistence enabled variant of memcached, a general-purpose distributed memory caching system often used to speed up dynamic database-driven websites by caching data and objects in memory. The main difference between MemcacheDB and memcached is that MemcacheDB has its own key-value database system based on Berkeley DB, so it is meant for persistent storage rather than as a cache solution. MemcacheDB is accessed through the same protocol as memcached, so applications may use any memcached API as a means of accessing a MemcacheDB database. MemcacheQ is a MemcacheDB variant that provides a simple message queue service. MemcacheDB is a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key- value based object storage and retrieval. It conforms to memcache protocol, so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported. Memcached was first developed by Brad Fitzpatrick for his website LiveJournal, on May 22, 2003. Features  High performance read/write for a key-value based object. Rapid set/get for a key-value based object, not relational. Benchmark will tell you the true later.
  9. 9.  High reliable persistent storage with transaction. Transaction is used to make your data more reliable.  High availability data storage with replication. Replication rocks! Achieve your HA, spread your read, make your transaction durable!  Memcache protocol compatibility. Lots of Memcached Client APIs can be used for Memcachedb, almost in any language, Perl, C, Python, Java, ... Why memcachedb? We have MySQL, we have PostgreSQL, we have a lot of RDBMSs, but why we need Memcachedb?  RDBMS is slow. All they have a complicated SQL engine on top of storage. Our data requires to be stored and retrieved damnable fast.  Not concurrent well. When thousands of clients, millions of requests happens...  But the data we wanna store is very small size! Cost is high if we use RDBMS.  Many critical infrastructure services need fast, reliable data storage and retrieval, but do not need the flexibility of dynamic SQL queries. o Index, Counter, Flags o Identity Management(Account, Profile, User config info, Score) o Messaging o Personal domain name o meta data of distributed system o Other non-relatonal data Performance Benchmark: MemcacheDB is very fast. Environment • Box: Dell 2950III • OS: Linux CentOS 5 • Version: memcachedb-1.0.0-beta • Client API: libmemcached a. Non-thread Edition Started: memcachedb -d -r -u root -H /data1/mdbtest/ -N -v Write (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 set) No. 1 2 3 4 5 6 7 8 avg. Cost(s) 807 835 840 853 859 857 865 868 848 2000000 * 8 / 848 = 18868 w/s Read (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 get) No. 1 2 3 4 5 6 7 8 avg. Cost(s) 354 354 359 358 357 364 363 365 360 2000000 * 8 / 360 = 44444 r/s b. Thread Edition(4 Threads)
  10. 10. Started: memcachedb -d -r -u root -H /data1/mdbtest/ -N -t 4 –v Write (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 set) No. 1 2 3 4 5 6 7 8 avg. Cost(s) 663 669 680 680 684 683 687 686 679 2000000 * 8 / 679 = 23564 w/s Read (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 get) No. 1 2 3 4 5 6 7 8 avg. Cost(s) 245 249 250 248 248 249 251 250 249 2000000 * 8 / 249 = 64257 r/s How this stuff works?? Source: http://memcachedb.org/ and http://memcachedb.org/memcachedb- guide-1.0.pdf Non Thread version Thread version
  11. 11. BerkeleyDB (Persistent storage used by memcachedb) Source: http://www.oracle.com/technology/products/berkeley-db/db/index.html Oracle Berkeley DB is a high-performance embeddable database providing SQL, Java Object and Key/Value storage. Berkeley DB offers advanced features including transactional data storage, highly concurrent access, replication for high availability, and fault tolerance in a self-contained, small footprint software library. Berkeley DB enables the development of custom data management solutions, without the overhead traditionally associated with such custom projects. Berkeley DB provides a collection of well-proven building-block technologies that can be configured to address any application need from the handheld device to the datacenter, from a local storage solution to a world-wide distributed one, from kilobytes to petabytes. Berkeley DB can be downloaded and the source code can be reviewed, then choose your build options and then compile the library in the configuration most suitable for your needs. The Berkeley DB library is a building block that provides the complex data management features found in enterprise class databases. These
  12. 12. facilities include high throughput, low-latency reads, non-blocking writes, high concurrency, data scalability, in-memory caching, ACID transactions, automatic and catastrophic recovery when the application, system or hardware fails, high availability and replication in an application configurable package. Simply configure the library and use the particular features available to satisfy your particular application needs. Oracle Berkeley DB fits where you need it regardless of programming language, hardware platform, or storage media. Berkeley DB APIs are available in almost all programming languages including ANSI-C, C++, Java, C#, Perl, Python, Ruby and Erlang to name a few. There is a pure-Java version of the Berkeley DB library designed for products that must run entirely within a Java Virtual Machine (JVM). We support the Microsoft .NET environment and the Common Language Runtime (CLR) with a C# API. Oracle Berkeley DB is tested and certified to compile and run on all modern operating systems including Solaris, Windows, Linux, Android, Mac OS/X, BSD, iPhone OS, VxWorks, and QNX to name a few. Storage engine design BerkeleyDB BerkeleyDB Java Ed. BerkeleyDB XML Written in C Written in Java Written in C++ Software Library Java Software Archive Software Library (JAR) Key/value API Key/value API Layered on Berkeley DB SQL API by incorporating Java Direct Persistence XQuery API by SQLite Layer (DPL) API incorporating XQilla BTREE, HASH, QUEUE, Java Collections API Indexed, optimized XML RECNO storage storage C++, Java/JNI, C#, Replication for High C++, Java/JNI, C#,
  13. 13. Python, Perl, ... Availability Python, Perl, ... Java Direct Persistence Replication for High Layer (DPL) API Availability Java Collections API Replication for High Availability Use cases of BerkeleyDB  Amazon’s Dynamo - http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html  BerkeleyDB Java Ed. On Android on http://www.oracle.com/technetwork/database/berkeleydb/bdb-je- android-160932.pdf  Infoflex Connect AB Embeds Critical Edge into High-Speed, High- Performance SMS Messaging Gateway - http://www.oracle.com/customers/ snapshots/infoflex-connect-database-snapshot.pdf
  14. 14. Document Stores
  15. 15. Other Info http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores http://ayende.com/Blog/category/565.aspx http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database- doomed.php

×