NoSql And The Semantic Web


Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

NoSql And The Semantic Web

  1. 1. NoSQL and the Semantic/ Social Web Irina Hutanu Alexandru Ioan Cuza University, Computer Science, Computional Linguistics (2nd year) Faculty of Letters graduate {Irina Hutanu,} Abstract. NoSQL is a new and promising method of storing and managing the world wide information. “Not only SQL”[5], as many seem to define it, is spreading rapidly because of its popular non-relational principle, which allows a better distribution on a horizontal scale. Further on we will try to disambiguate this new born movement. 1 Introduction. This type of database can handle a large amount of information because of some interesting features that increase the storage power:  The Consistency requirement is limited. It is said you cannot have Consistency, Availability and Partitioning at the the same time. ( CAP Theorem)  Key/ Value storage. A quite primitive manner to stockpile.  It runs on a large number of machines, the information being replicated and partitioned among them. Some of the most important and highly rated database applications that function in the above manner are GoogleBigtable, HBase, Hypertable, AmazonDynamo, Voldemort, Cassandra, Riak, CouchDB, MongoDB, Redis. The data-driven sites like, Google, Facebook work with terabytes of information that needs to be immediately scaled and partitioned in a very efficient manner. On the other hand, these Internet giants also use tens of thousands of servers and machines located all around the world. Consequently, many drawbacks and failures happen every second, but the transactions must stay “always-on”. Every minor problem occuring while a customer/ user queries the database, causing him/her to lose contact with the informational target, may lead to serious financial loss. Such risks must not be taken for granted, therefore apps like Dynamo or Bigtable emerged. Their non- relational architecture, incremental scalability and decentralized character offer a quite robust data storage system. 2 Architecture 2.1 Partitioning Process One important feature of a NoSQL system is that it has to scale incrementally the information. In order for this to happen rapidly and consistently, Dynamo, for example, uses the idea of virtual nodes in the partitioning process. That means that a node is not mapped only to one position but to various ones, this way non-uniform distribution is not a problem. Also, if a specific node has limited access or disappears because of a system failure, the data load contained in that virtual node is available in some other nodes properly working.
  2. 2. Bigtable, another non-relational storage system, uses another type of partitioning and gathering- data tool. Being “a sparse, distributed, persistent multi-dimensional sorted map”[1] it uses rows, columns and timestamps. The partitioning process takes place dynamically and it is applied to the row’s range. 2.2 Replication On the other hand, non-stop data availability is also assured by the replicational system. These apps replicate, in general, all the information acquired on multiple hosts in order to avoid loss of information and to offer durability. Bigtable, for instance, uses a replication process that allows information to be duplicated in different clusters, thus latency is avoided and data is assured against any loss: “The Personalized Search data is replicated across several Bigtable clusters to increase availability and to reduce latency due to distance from clients. The Personalized Search team originally built a client-side replication mechanism on top of Bigtable that ensured eventual consistency of all replicas. The current system now uses a replication subsystem that is built into the servers.”[1] Fig. 1. Partitioning and Replication in Dynamo1 2.3 Consistency versus Availability If a multiple versions of the same data exist, they must be reconciled to avoid any possible system failures. Unfortunately, in a system that trades consistency for availability, reconciling divergent versions is almost impossible to obtain. Dynamo, for example, works with some vector clocks to filter the emergence of two or mode different versions of the same object. In some cases this method cannot control the number of the divergent versions, thus semantic reconciliation is used. However, this approach determines an overload of the entire system, so it’s used only if extreme cases ask for it. Anyway, with the exception of some minor issues that might cause problems like overloading, the choice of availability against consistency gave rise to some interesting and unexpected results, marking, to some extent, a real success: “The production use of Dynamo for the past year 1 Image from Dynamo: Amazon’s Highly Available Key-value Store
  3. 3. demonstrates that decentralized techniques can be combined to provide a single highly-available system. Its success in one of the most challenging application environments shows that an eventual-consistent storage system can be a building block for highly-available applications.” [2] Fig.2. Version evolution of an object over time2. 2.4 Gossip Protocol This protocol is used both in the updating process and in detecting failures. If a node becomes unavailable it communicates its state to another node, allowing the reorganization of data between the functioning nodes. Thus the virtual nodes are programmed to contact one another every second in a random order to synchronize their history of membership changes. The process of failure detection is undergone through the same gossip protocol. A node is considered to be unavailable if it does not respond to the message of another node. The latter node will get the information required from another virtual node and periodically retries the first one to search for its recovery. This is in fact a decentralized manner of detection because we don’ have an upper, superior entity that points out the defective nodes. What we have is a gossip process that enables each node to “hear” about the new arrival or departure of other nodes: “Dynamo adopts a full membership model where each node is aware of the data hosted by its peers. To do this, each node actively gossips the full routing table with other nodes in the system. This model works well for a system that contains couple of hundreds of nodes.”[2] 2 Image from Dynamo: Amazon’s Highly Available Key-value Store
  4. 4. Fig.3. Gossip-style process3. 3. Final Remarks A somehow new movement in the storage domain, NoSQL succeds in dethroning classical SQL systems based on a relational and centralized information processing. The nowadays web realities imply the coordination, manipulation and gathering of vast quantities of data and knowledge. Thus the traditional database applications seem to have lost their applicability in favor of the non- relational systems that avoid to use joint operations or fixed schemas and, to some extent, even break the ACID guarantees by developing processes only “eventually consistent”[3]. 4. References [1] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Grube, Bigtable: A Distributed Storage System for Structured Data, Appeared in: OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006. [2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, Dynamo: Amazon’s Highly Available Key-value Store, 2007 [3] Werner Vogels, Eventually consistent- Revisited, 2008 [4] SQL Databases Don't Scale [5] 3 Image from Pragmatic Programming Techniques