SlideShare a Scribd company logo
1 of 10
Download to read offline
NoSQL initiative and its influences on social and
                        semantic Web

                               Stefan Prutianu, Stefan Ceriu
             Faculty of Computer Science, „Al. I. Cuza“ University, Iasi, Romania
                          { stefan.prutianu, stefan.ceriu}@info.uaic.ro



       Abstract. In this paper we describe NoSQL, a series of non-relational database
       technologies and products developed to address the current problems the
       RDMS system are facing: lack of true scalability, poor performance on high
       data volumes and low availability. Some of these products have already been
       involved in production and they perform very well: Amazon’s Dynamo,
       Google’s Bigtable, Cassandra, etc. Also we provide a view on how these
       systems influence the applications development in the social and semantic Web
       sphere.

       Keywords: NoSQL, distributed computing, distributed non-relational database,
       semantic Web, social Web, scalability




1 Introduction


   Modern relational database technologies tend to have serious problems when it
comes to managing huge volumes of data (eBay - 2PB of data overall [2]) as they are
today and these problems are: scalability, performance and rigid schema design.[1]
   Vertical scaling (increasing the computational power of a single node) is just a
temporary solution until the data grows again beyond the storage limit.
   Horizontal scaling in traditional relational database management system
(partitioning, sharding) means dividing the data into multiple databases according to
some application-specific boundaries, but splitting the data across multiple servers
breaks the relationships stored within the database, the most valuable property of a
relational database and it is also not transparent to the application’s business logic.
   Read slaves is a form of horizontal scaling used in RDMS (Relational Database
Management System) where a read-only slave database is replicating the master
database so every write is redirected to the master database and every read to one of
the read slave replicas, but it is still not true scaling because of single failure point.
   Large relational databases (multi terabytes or petabytes in size) usually perform
slowly on complex queries because of the amount of data they have to scan and
because these systems design is disk-oriented and disk operations are time
consuming. [3]
   RDMS requires that the database schema be designed before starting using the data
(tables, columns, relationships) and in most cases such a schema will require changes
(adding new features, adjusting or fine tuning some other features) but changing the
database schema is very hard in such systems (updating rows may lock them and it is
a very time consuming operation). [12]
   NoSQL is the common name under a set of new technologies, design practices and
open-source developed projects which address the problems that large scale
distributed applications and platforms are facing: scalability, availability,
performance, fault tolerance.The NoSQL trend is not intended to replace the relational
database model; instead it proposes new solutions to problems that the traditional
database model cannot solve.
   This paper is structured as follows. Section 2 describes the NoSQL trend in detail
with its proposed solutions and results, Section 3 presents how NoSQL influenced the
application development in Social and Semantic Web sphere and Section 4 concludes
our survey.




2 NoSQL


2.1 Overview

NoSQL proponents started to manifest more seriously in early 2009 when they
proposed solutions of distributed databases that can be used in systems where the
relational features present in RDMS are not needed. The inspiration points for these
were the closed-source distributed databases already available in some large
corporations such as: Dynamo from Amazon and Bigtable from Google.
   These solutions along with the open-source projects (Cassandra, Hypertable,
HBase, Redis) share a number of characteristics: key-value storage, run on a large
number of machines, data are partitioned and shared among these machines.
   Another common characteristic of these is that in order to get the level of
scalability, availability, performance and fault (partition) tolerance desired the data
consistency requirement is relaxed and this is because of the Eric Brewer’s CAP
Theorem which proves that in a distributed environment you cannot get Consistency,
Availability and Partition Tolerance at the same time [6] so most of these system
achieve a particular form of weak consistency named eventual consistency.
   Consistency means that a system operates fully or not at all; in a distributed
environment if an update is made to some node, all its replicas are updated until any
read from those replicas are performed. Consistency can be achieved by using
relational databases because they focus on ACID (Atomicity, Consistency, Isolation,
Durability) properties.
   Availability means that a system is always available to perform requested tasks.
   Partition Tolerance is the ability of a distributed system to work even in case of
partition forming – one or more nodes are isolated from the others due to
network/communication failures.
Eventual consistency is a specific form of weak consistency; if no new updates are
made to the object, eventually all reads will return the updated value. [7] DNS
(Domain Name System) is a system that implements eventual consistency.

   Dynamo. Amazon’s Dynamo is a highly available key-value structured storage
system[4]. It was developed to meet Amazon’s needs for reliability and scaling.
Access to data is provided through a primary-key interface (get(key), put(key) and
overloads of these operations), scalability and availability are achieved through a
combination of techniques: consistent hashing for data partitioning and replication,
data consistency is facilitated by object versioning, consistency among replicas during
updated uses a quorum technique and decentralized replica synchronization protocol
and for failure detection and membership updates a gossip-based protocol is used.
Amazon’s engineers motivated their choice when implementing this system by the
fact that most of the services their platform exposes store and retrieve data by a
primary key thus not requiring the complex querying and management functionality
within a RDMS, the cost of maintenance for a RDMS and also using traditional
storage models the availability would be sacrificed in favor of consistency. Dynamo
components for request coordination, membership and failure detection and local
persistence engine are all implemented in Java. Local persistence component has a
pluggable design and uses engines like: BDB (Berkeley Database) Transactional Data
Store, BDB Java Edition, MySQL and in-memory buffer with persistent backing
storage.[4]

   Bigtable. Google’s Bigtable is a distributed storage system for managing
structured data designed to be highly scalable. This system has proven its efficiency
in important applications from Google: Personalized Search, Google Analytics,
Google Earth, Google Finance. Bigtable does not support a full relational data model;
instead it provides clients with a simple data model indexed using row, columns and
timestamps. From the data model point of view Bigtable is a sparse, distributed
persistent, multi-dimensional sorted map where each value in the map is an
uninterpreted array of bytes. Row keys are arbitrary strings and data in Bigtable is
maintained in lexicographic order by row keys and every read/write under a row key
is considered an atomic operation regardless the number of columns involved.
Columns are grouped in sets called column families and usually these contain
information of the same type. Timestamps are introduced because each cell can
contain multiple versions of the same data. Bigtable API provides functions for:
creating and deleting tables and column families, reads/updates under a particular key
and other operations involving cluster management. A master model is use to manage
load balancing and fault tolerance. For internal persistence Bigtable uses SSTable
(immutable sorted file of key-value pairs) file format in conjunction with GFS
(Google File System). [5]
2.2 Design patterns [8], [11]

API Model. Because the underlying data model can be considered as a large
distributed hashtable (DHT) the basic API (Application Programming Interface) could
be:
- get(key) – extract the value at the given key .
- put(key) – updates the values at the given key .
- delete(key) – removes the key and its associated value .

Machine Infrastructure. The infrastructure for these kind of systems is composed of
a large number of machines with commodity hardware connected together through a
network. Each machine (physical node) has the same software configuration, but the
hardware characteristics may not be the same. Within each physical node there are a
number of virtual nodes running.

Partition Schemes. Most large scale distributed system uses a consistent hashing
technique due to its flexibility when the number of virtual nodes is altered. When
nodes are added or removed keys and data need to be redistributed and a consistent
hashing technique minimizes the amount of these changes. In the consistent hashing
technique the key space is finite; the output range of a hash function is treated as a
fixed ring. Both virtual node ids and data items keys take values in this circular space
and the owner of a set of keys identifying data items is considered as the first virtual
node encountered walking the ring clockwise from that key. In case of virtual nodes
crashes all the keys owned by the failing node will be adopted by its clockwise
neighbor thus the rest of virtual nodes on the ring are not affected.

Data Replication. In order to achieve high availability and performance same data
need to be available on multiple nodes – replicas. In Dynamo [4] the list of nodes
responsible for storing a particular key is called preference list and the size of this list
is configured by a preset parameter. While read actions can be performed on any
replica, update actions can lead to some consistency issues because the updates need
to be propagated to all the replicas.

Data Models. The basic data access method is to use a key in order to retrieve or
update a value. Value can be: blob (binary large object) [4], document, column family
(rows and columns, but the rows can have as few or as many columns as desired) [5],
graph or collection.

Storage Models. The most used strategy is to design this component in a pluggable
fashion where storage mechanisms can be: MySQL DB, Berkeley DB, filesystem -
SSTables, or in-memory storage – memtables.

Consistency Management. The same data is available on multiple nodes at a given
time and the problem that arises is to synchronize these replicas in order to preserve a
consistent view of data from the client perspective. In such systems where availability
and partition tolerance are an important requirement strict consistency cannot be
achieved at the same time with first two properties (CAP Theorem) thus a form of
weak consistency – eventual consistency is implemented in these systems. There are
various mechanism that will guarantee such systems will eventually become
consistent after a period of time (inconsistency window) during which
synchronization is performed.
Timestamps. Using the history of operations performed on a row of data can be
decided to what value the row will eventually converge to. The drawbacks of this
method are: requires synchronized clocks on nodes, don’t capture causality, a decision
is hard to take when write operations happened simultaneously.
Vector clocks. A vector clock is a tuple {t1, t2,…,tn} of clock values from each node.
When a write operation is performed on node i it sets ti to its clock value. Given two
vector clocks v1 and v2, v1 < v2 (if for all k v1[k] ≤ v2[k]) implies the global time
ordering of events. There are certain rules that replicas follow when updating their
vector clock:
     - when an internal operation happens at replica i it will advance its vector
          clock vi[i]
     - when replica i sends a message to replica j it also attaches its vector clock to
          the message
     - when replica j receives a message from replica i it will advance its clock
          vj[j] and then merge it with the vector clock received in the message vj[k] =
          max(vi[k], vi[k])
Single Master Model. In this model each data partition has a master node and multiple
slave nodes. Updates are redirected to the master node and then, asynchronously, the
update propagates to the slave nodes. Sometimes using this model a system can
become unavailable if the master has failed and none of the replicas have been
updated yet.
Multi-Master Model. In certain key ranges intensive requests for updates will cause
the Single Master Model to be unable to spread the workload correctly. Multi-Master
Model allows updates to be performed at any replicas.
Quorum Based 2PC. Assuming that there are N replicas of some data and a
coordinator node, when an update is requested the coordinator sends the request to all
the N replicas but it has to wait for only W (W < N) successful answers. The same
happens in read actions, the coordinator sends the request to the N replicas, but has to
wait only for R (R < N) successful responses and from all the answering nodes the
one with the highest timestamp is selected. This protocol is flexible because
configuring the W and R values accordingly different levels of consistency can be
achieved: W+R>N – strict consistency, W+R ≤ N - the model of consistency is
relaxed to a weaker one.

Membership Management. Since nodes in a cluster may fail or recover the need for
a technique that will allow nodes to know about each other arises.
Omniscient Master. When nodes leave or join a cluster they communicate with a
master node that holds the authoritative view of the cluster. This method is simple and
provides a consistent view of cluster status, but these is still a single point of failure
and the model is not partition tolerant.
Gossip. This is a method to propagate cluster status to all the members. Every preset
amount of time a node selects another to communicate its view about the cluster with.
Every node maintains a timestamp of the information about itself and the rest of the
cluster. This method is scalable and failure tolerant but provides eventual consistency
about cluster status.


2.3 Open-Source Projects



Dynamo [4] and Bigtable [5] constituted a great starting point for developing open-
source, non-relational, distributed and horizontal scalable databases. NoSQL
movement began in early 2009 and grows rapidly into a consistent list of free and
competitive products providing most of necessary properties in distributed systems:
schema-free, replication support, easy API, eventual consistency, performance.
Bellow is presented a non-exhaustive list of current databases and their classifications
along three important characteristics: scalability, data and query model, internal
persistence model.


                        Scalability              Data and Query Model            Persistence
                                                                                   Model


                     Add new          Support    Data             Query API
                   machines       for multiple   Model
                 transparently    datacenters
                to applications

 Cassandra                                       Column          Thrift          Memtable/
                                                 family                          SStable

 HBase                                           Column       Thrift, REST       Memtable/
                                                 family                          SStable on
                                                                                 HDFS

 Riak                                            Document     Nested hashes         ?


 Scalaris                                        Key/value       get/put         in-memory only


 Voldemort                        under          Key/value       get/put         BDB, MySQL
                                  development

 CouchDB                                         Document     map/reduce views   append-only B-
                                                                                 Tree

 MongoDB                                         Document        Cursor             B-Tree

 Neo4j                                              Graph        Graph           on-disk   linked
                                                                                 lists

 Redis                                           Collection      Collection      in-memory


 Tokyo                                           Key/value       get/put         hash or B-Tree
 Cabinet
 Chordless                                       Key/value    Java, simple RPC      ?
Add new          Support    Data          Query API     Persistence
                   machines       for multiple   Model                       Model
                 transparently    datacenters
                to applications

     InfoGrid                                      Graph   Java, http/REST     ?

      Sones                                        Graph      .Net             ?




 Table. 1. Classification by scalability, data and query model and persistence model
[1], [13]

This table summarizes the most important characteristics of a subset from non-
relational database systems currently available. The rest of this section will focus on
describing some of these databases.

   Cassandra. This system development started at Facebook and one of its designers
was a co-author of Dynamo. At the moment the project is open source and still under
“heavy development” at The Apache Software Foundation. Their authors define it as
a “structured storage system over a P2P network”. [11] This system combines the
distributed architecture of Dynamo and the column family model from Bigtable. From
the data model point of view Cassandra it is a multi-dimensional map indexed by a
key where each application creates its own key space. Besides column family a new
concept of super columns is introduced which represents lists of columns. Data is
sorted at write operations and also within a row columns are sorted by their name.
Partitioning subsystem is similar to Dynamo approach - consistent hashing is used.
The same concepts of coordinator node and preference list as in Dynamo are used for
data replication. Cluster management uses a variant of Gossip technique – Scuttlebutt
anti-entropy Gossip. Internal persistence relies on the local file system and storage
structure is similar to the one in Bigtable: SSTable, memtable, commit logs,
compaction and Bloom filters. The system is written in Java and high level libraries
are available for: Ruby, Perl, Python, Scala. Facebook, Digg and Rackspace use this
system in production. [11], [12]
   Voldemort. Key-value store systems developed by Linkenin engineers implements
most of the features available in Dynamo: partition and replication (consistent
hashing, preference list), object versioning (vector clocks), pluggable storage
component (BDB, in-memory, MySQL). Voldemort also comes with a series of new
features: serialization, support for read-only nodes, compression. Linkedin uses this
system as its underlying storage system. [11], [12]
   Riak. Key-value store system that uses documents as values, using the same
architecture and algorithms as Dynamo. Implementation is done in Erlang and various
client libraries are available: Jiak Client (Erlang (JSON)), Riak (Erlang (raw)),
Pyhton, PHP, Ruby, Java, JavaScript. There are no known examples of usages in
production. [12]
   Redis. Key-value store where values can have multiple types: strings, lists, sets,
ordered sets. Replication is achieved via a Master – Slave model, client libraries
(available in PHP, Ruby, Scala) are responsible for partitioning. It uses a memory-
driven approach with asynchronously snapshots to disk for local persistence. Some
other supported operations depend on the values data types: increments, decrements,
atomic multi-set (Strings); push, pop , range get (Lists); intersection, union, difference
(Sets), sorting. It is written in ANSI C and it is used in production at: Github, Engine
Yard, VideoWiki. [12]
     Neo4j. This is a disk-based (data is stores in a custom binary format), fully
transactional Java persistence engine that stores data structures in graphs. Some of its
most important features are: graph-oriented mode for data representation (stores,
nodes, relationships and properties), high scalability (both across the same machine
but also on multiple machines), OO simple Java API, optional layers to expose itself
as a RDS Store, express meta model semantics using OWL, query the graph using
SPARQL. [14]


3 NoSQL in the social and semantic Web context

Semantic Web is an initiative of the World Wide Web Consortium (W3C) which
involves transforming the Web so that the data available today can be understood and
reused by machines. On a less abstract level this means attaching meta-data to the
resources on the Web and to specify relationships between these resources. The core
of the Semantic Web is a set of design principles, standards already widely used on
the Web - XML, XML Schema, formal definitions of language used in expressing
data models - Resource Description Framework (RDF), vocabulary for describing
properties of models based on RDF - Resource Description Framework Schema
(RDFS), vocabulary for creating ontologies - Ontology Web Language (OWL), data
query services - SPARQL and other, under development, standards - Rule Interchange
Format (RIF ), Unifying Logic and Proof layers.
   Social Web is the term used to describe how people socialize and interact each
other throughout the WWW. Classic examples of distributed web applications that
favored development of large social networks are: Facebook, MySpace, Linkedin,
Flickr, Twitter, Del.icio.us, etc.
   Regarding NoSQL influence on Semantic Web the vast list of database system
developed, each exposing new techniques of managing data, contains some examples
that may address problems like: managing RDF stores, managing ontologies or
creating SPARQL endpoints.
   Neo4j is probably the most obvious example of such a store system. Its graph-
oriented data model makes it perfect to store RDF triples or complex ontologies.
Despite the fact that databases using this graph-oriented data view are able to manage
a much reduced volume of information that the other types of non-relational data
stores (key-value, column family, documents) this volume is still a large one: billions
of nodes and relationships. Neo4j developers affirm that the traversal component of
this system is a high-performance one and it’s over 6 years of enrolment in production
rises the degree of confidence in this system. [14]
   HBase (The Hadoop Database) is a scalable, distributed, column oriented, dynamic
schema database for structured data, modeled after Google Bigtable and under
development at ASF (Apache Software Foundation). HBase data model can be
viewed as a multi-dimensional map where values are indexed by 4 keys (TableName,
RowKey, ColumnKey, Timestamp). Values are binary data, rows are sorted in
lexicographic order and columns are grouped in column families. The database
schema is flexible and it can be modified at run-time. Such a dynamic schema allows
this system to store Semantic Web data. An example of such a modeling can be found
in [17].
   Applications in the Social Web sphere have a longer history than Semantic Web
applications so the scalability, performance, availability or huge volumes of data
became issues vital to these applications. Cassandra, one of the most important non-
relational distributed stores, is already used in production in large social applications:
Facebook, Digg. A comparison with MySQL on 50 GB of data shows that Cassandra
performs better. [11]

                              Read                           Write
MySQL                         ~350 ms                        ~300 ms
Cassandra                     0.12 ms                        15 ms



          Table. 2. Performance comparison between MySQL and Cassandra on 50 GB
of data



4 Conclusion

   RDMS have served large informational systems for over 30 year but current
amount of data that needs to be managed causes multiple problems with these
systems. In order to address problems like: scalability, performance, availability a
new set of technologies and non-relational databases have been developed and they
are collectively known under the term NoSQL.
   This paper presents the techniques and design practices that lye under these new
database products most of which are inspired from already existing and reliable
systems like Amazon’s Dynamo and Google’s Bigtable. Also few ideas on how these
systems already influence the applications development for the semantic and social
Web are expressed.
   The NoSQL trend began to grow rapidly in early 2009 and within a relatively short
period of time a big number of non-relational database solutions appeared and part of
them already became components of various large scale applications. As future
research we are thinking at studying in even great detail the current techniques used in
designing such system and possibly eliminating the vulnerabilities that may cause
some of them to fail in certain scenarios.
References

   1.    Ellis, Jonathan: NoSQL Ecosystem,
         http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/ (2009)
   2.    Shoup, Randy: eBay Marketplace Architecture: Architectural Strategies, Patterns, and
         Focuses (2007)
   3.    Bloor, Robin: 6 Reason Why Relational Database Will Be Superseded (2008)
   4.    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G.,Lakshman, A., Pilchin, A.,
         Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available
         Key-value Store (2007)
   5.    Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach , D. A.,
         Burrows, M., Chandra, T., Fikes, A., Gruber, R. E.: Bigtable: A Distributed Storage
         System for Structured Data (2006)
   6.    Brewer, Eric A.: Towards Robust Distributed Systems, Principles Of Distributed
         Computing (2000)
   7.    Vogels, W: Eventually Consistent,
         http://www.allthingsdistributed.com/2008/12/eventually_consistent.html (2008)
   8.    Ho, Ricky: Pragmatic Programming Techniques,
         http://horicky.blogspot.com/2009/11/nosql-patterns.html (2009)
   9.    Wiggins, Adam: SQL Databases Don’t Scale,
         http://adam.blog.heroku.com/past/2009/7/6/sql_databases_dont_scale/ (2009)
   10.   Browne, Julian: Brewer’s CAP Theorem,
         http://www.julianbrowne.com/article/viewer/brewers-cap-theorem (2009)
   11.   NOSQL debrief, http://blog.oskarsson.nu/2009/06/nosql-debrief.html (2009)
   12.   Gupta, Vineet: NoSQL Databases – Part 1- Landscape,
         http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html (2010)
   13.   NoSQL – Your Ultimate Guide to Non – Relational Universe, http://nosql-
         databases.org/
   14.   Neo4j – the graph database, http://neo4j.org/
   15.   Semantic Web, http://en.wikipedia.org/wiki/Semantic_Web
   16.   Social Web, http://en.wikipedia.org/wiki/Social_web
   17.   Mateescu, Gabriel: Finding the way through the semantic Web with HBase,
         http://www.ibm.com/developerworks/opensource/library/os-
         hbase/index.html?ca=dgr-twtrHBasedth-
         OS&S_TACT=105AGY83&S_CMP=TWDW (2009)

More Related Content

What's hot

Khushali Patel-resume-
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-Khushali11
 
SenaritraMSBI_Resume
SenaritraMSBI_ResumeSenaritraMSBI_Resume
SenaritraMSBI_ResumeSenaritra Das
 
Resume_Grace Li
Resume_Grace LiResume_Grace Li
Resume_Grace LiAngie Li
 
Reach End Users With Next Generation Web Applications
Reach End Users With Next Generation Web ApplicationsReach End Users With Next Generation Web Applications
Reach End Users With Next Generation Web ApplicationsJeff Blankenburg
 
Multi Tier Architecture
Multi Tier ArchitectureMulti Tier Architecture
Multi Tier Architecturegatigno
 
Satyajeet_Parida-SQL_SERVER_DBA
Satyajeet_Parida-SQL_SERVER_DBASatyajeet_Parida-SQL_SERVER_DBA
Satyajeet_Parida-SQL_SERVER_DBASatyajeet Parida
 
Professional Portfolio
Professional PortfolioProfessional Portfolio
Professional PortfolioMoniqueO Opris
 
Metaaso J Webframework
Metaaso J WebframeworkMetaaso J Webframework
Metaaso J Webframeworkjwebframework
 
Mobile Responsive Social Corporate Intranet Portal Application
Mobile Responsive Social Corporate Intranet Portal ApplicationMobile Responsive Social Corporate Intranet Portal Application
Mobile Responsive Social Corporate Intranet Portal ApplicationMike Taylor
 
Kiran_Patil_JavaDeveloper
Kiran_Patil_JavaDeveloperKiran_Patil_JavaDeveloper
Kiran_Patil_JavaDeveloperKIRAN PATIL
 
Mahesh_Resume
Mahesh_ResumeMahesh_Resume
Mahesh_ResumeMahesh B
 
Rakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh Kumar
 
SQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginners
SQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginnersSQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginners
SQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginnersTobias Koprowski
 

What's hot (20)

PRATIK MUNDRA
PRATIK MUNDRAPRATIK MUNDRA
PRATIK MUNDRA
 
Khushali Patel-resume-
Khushali Patel-resume-Khushali Patel-resume-
Khushali Patel-resume-
 
Toad
ToadToad
Toad
 
SenaritraMSBI_Resume
SenaritraMSBI_ResumeSenaritraMSBI_Resume
SenaritraMSBI_Resume
 
Resume Vikram_S
Resume Vikram_SResume Vikram_S
Resume Vikram_S
 
Resume_Grace Li
Resume_Grace LiResume_Grace Li
Resume_Grace Li
 
Reach End Users With Next Generation Web Applications
Reach End Users With Next Generation Web ApplicationsReach End Users With Next Generation Web Applications
Reach End Users With Next Generation Web Applications
 
MarkAndrews
MarkAndrewsMarkAndrews
MarkAndrews
 
Satheesh Oracle DBA Resume
Satheesh Oracle DBA ResumeSatheesh Oracle DBA Resume
Satheesh Oracle DBA Resume
 
Multi Tier Architecture
Multi Tier ArchitectureMulti Tier Architecture
Multi Tier Architecture
 
vikram ch resume
vikram ch resumevikram ch resume
vikram ch resume
 
Satyajeet_Parida-SQL_SERVER_DBA
Satyajeet_Parida-SQL_SERVER_DBASatyajeet_Parida-SQL_SERVER_DBA
Satyajeet_Parida-SQL_SERVER_DBA
 
Professional Portfolio
Professional PortfolioProfessional Portfolio
Professional Portfolio
 
Metaaso J Webframework
Metaaso J WebframeworkMetaaso J Webframework
Metaaso J Webframework
 
Mobile Responsive Social Corporate Intranet Portal Application
Mobile Responsive Social Corporate Intranet Portal ApplicationMobile Responsive Social Corporate Intranet Portal Application
Mobile Responsive Social Corporate Intranet Portal Application
 
Kiran_Patil_JavaDeveloper
Kiran_Patil_JavaDeveloperKiran_Patil_JavaDeveloper
Kiran_Patil_JavaDeveloper
 
CustomerCopy
CustomerCopyCustomerCopy
CustomerCopy
 
Mahesh_Resume
Mahesh_ResumeMahesh_Resume
Mahesh_Resume
 
Rakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resumeRakesh sr dwh_bi_consultant resume
Rakesh sr dwh_bi_consultant resume
 
SQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginners
SQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginnersSQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginners
SQLSaturday#290_Kiev_AdHocMaintenancePlansForBeginners
 

Viewers also liked

Viewers also liked (8)

Corona SDK for 4Square Hackathon
Corona SDK for 4Square HackathonCorona SDK for 4Square Hackathon
Corona SDK for 4Square Hackathon
 
DWX2015 Code Generierung
DWX2015 Code GenerierungDWX2015 Code Generierung
DWX2015 Code Generierung
 
Git your life for fun & profit
Git your life for fun & profitGit your life for fun & profit
Git your life for fun & profit
 
Git SCM
Git SCMGit SCM
Git SCM
 
SMalL - Semantic Malware Log Based Reporter
SMalL  - Semantic Malware Log Based ReporterSMalL  - Semantic Malware Log Based Reporter
SMalL - Semantic Malware Log Based Reporter
 
An introduction to git
An introduction to gitAn introduction to git
An introduction to git
 
Git workflows presentation
Git workflows presentationGit workflows presentation
Git workflows presentation
 
Git basics
Git basicsGit basics
Git basics
 

Similar to NoSQL On Social And Sematic Web

Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataDebajani Mohanty
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic WebIrina Hutanu
 
Nosql availability & integrity
Nosql availability & integrityNosql availability & integrity
Nosql availability & integrityFahri Firdausillah
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيMohamed Galal
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLRamakant Soni
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfajajkhan16
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databasesijaia
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBhavya Gulati
 
Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)sones GmbH
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptxShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxVishalBH1
 
no sql presentation
no sql presentationno sql presentation
no sql presentationchandanm2
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sqlAnuja Gunale
 

Similar to NoSQL On Social And Sematic Web (20)

Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
CouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big DataCouchBase The Complete NoSql Solution for Big Data
CouchBase The Complete NoSql Solution for Big Data
 
RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
No sql database
No sql databaseNo sql database
No sql database
 
Nosql availability & integrity
Nosql availability & integrityNosql availability & integrity
Nosql availability & integrity
 
Know what is NOSQL
Know what is NOSQL Know what is NOSQL
Know what is NOSQL
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربي
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
NOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdfNOSQL in big data is the not only structure langua.pdf
NOSQL in big data is the not only structure langua.pdf
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Evaluation of graph databases
Evaluation of graph databasesEvaluation of graph databases
Evaluation of graph databases
 
Big data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edgeBig data analytics: Technology's bleeding edge
Big data analytics: Technology's bleeding edge
 
Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)Whitepaper sones GraphDB (eng)
Whitepaper sones GraphDB (eng)
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 
nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
no sql presentation
no sql presentationno sql presentation
no sql presentation
 
1. introduction to no sql
1. introduction to no sql1. introduction to no sql
1. introduction to no sql
 

Recently uploaded

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 

Recently uploaded (20)

How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 

NoSQL On Social And Sematic Web

  • 1. NoSQL initiative and its influences on social and semantic Web Stefan Prutianu, Stefan Ceriu Faculty of Computer Science, „Al. I. Cuza“ University, Iasi, Romania { stefan.prutianu, stefan.ceriu}@info.uaic.ro Abstract. In this paper we describe NoSQL, a series of non-relational database technologies and products developed to address the current problems the RDMS system are facing: lack of true scalability, poor performance on high data volumes and low availability. Some of these products have already been involved in production and they perform very well: Amazon’s Dynamo, Google’s Bigtable, Cassandra, etc. Also we provide a view on how these systems influence the applications development in the social and semantic Web sphere. Keywords: NoSQL, distributed computing, distributed non-relational database, semantic Web, social Web, scalability 1 Introduction Modern relational database technologies tend to have serious problems when it comes to managing huge volumes of data (eBay - 2PB of data overall [2]) as they are today and these problems are: scalability, performance and rigid schema design.[1] Vertical scaling (increasing the computational power of a single node) is just a temporary solution until the data grows again beyond the storage limit. Horizontal scaling in traditional relational database management system (partitioning, sharding) means dividing the data into multiple databases according to some application-specific boundaries, but splitting the data across multiple servers breaks the relationships stored within the database, the most valuable property of a relational database and it is also not transparent to the application’s business logic. Read slaves is a form of horizontal scaling used in RDMS (Relational Database Management System) where a read-only slave database is replicating the master database so every write is redirected to the master database and every read to one of the read slave replicas, but it is still not true scaling because of single failure point. Large relational databases (multi terabytes or petabytes in size) usually perform slowly on complex queries because of the amount of data they have to scan and because these systems design is disk-oriented and disk operations are time consuming. [3] RDMS requires that the database schema be designed before starting using the data (tables, columns, relationships) and in most cases such a schema will require changes
  • 2. (adding new features, adjusting or fine tuning some other features) but changing the database schema is very hard in such systems (updating rows may lock them and it is a very time consuming operation). [12] NoSQL is the common name under a set of new technologies, design practices and open-source developed projects which address the problems that large scale distributed applications and platforms are facing: scalability, availability, performance, fault tolerance.The NoSQL trend is not intended to replace the relational database model; instead it proposes new solutions to problems that the traditional database model cannot solve. This paper is structured as follows. Section 2 describes the NoSQL trend in detail with its proposed solutions and results, Section 3 presents how NoSQL influenced the application development in Social and Semantic Web sphere and Section 4 concludes our survey. 2 NoSQL 2.1 Overview NoSQL proponents started to manifest more seriously in early 2009 when they proposed solutions of distributed databases that can be used in systems where the relational features present in RDMS are not needed. The inspiration points for these were the closed-source distributed databases already available in some large corporations such as: Dynamo from Amazon and Bigtable from Google. These solutions along with the open-source projects (Cassandra, Hypertable, HBase, Redis) share a number of characteristics: key-value storage, run on a large number of machines, data are partitioned and shared among these machines. Another common characteristic of these is that in order to get the level of scalability, availability, performance and fault (partition) tolerance desired the data consistency requirement is relaxed and this is because of the Eric Brewer’s CAP Theorem which proves that in a distributed environment you cannot get Consistency, Availability and Partition Tolerance at the same time [6] so most of these system achieve a particular form of weak consistency named eventual consistency. Consistency means that a system operates fully or not at all; in a distributed environment if an update is made to some node, all its replicas are updated until any read from those replicas are performed. Consistency can be achieved by using relational databases because they focus on ACID (Atomicity, Consistency, Isolation, Durability) properties. Availability means that a system is always available to perform requested tasks. Partition Tolerance is the ability of a distributed system to work even in case of partition forming – one or more nodes are isolated from the others due to network/communication failures.
  • 3. Eventual consistency is a specific form of weak consistency; if no new updates are made to the object, eventually all reads will return the updated value. [7] DNS (Domain Name System) is a system that implements eventual consistency. Dynamo. Amazon’s Dynamo is a highly available key-value structured storage system[4]. It was developed to meet Amazon’s needs for reliability and scaling. Access to data is provided through a primary-key interface (get(key), put(key) and overloads of these operations), scalability and availability are achieved through a combination of techniques: consistent hashing for data partitioning and replication, data consistency is facilitated by object versioning, consistency among replicas during updated uses a quorum technique and decentralized replica synchronization protocol and for failure detection and membership updates a gossip-based protocol is used. Amazon’s engineers motivated their choice when implementing this system by the fact that most of the services their platform exposes store and retrieve data by a primary key thus not requiring the complex querying and management functionality within a RDMS, the cost of maintenance for a RDMS and also using traditional storage models the availability would be sacrificed in favor of consistency. Dynamo components for request coordination, membership and failure detection and local persistence engine are all implemented in Java. Local persistence component has a pluggable design and uses engines like: BDB (Berkeley Database) Transactional Data Store, BDB Java Edition, MySQL and in-memory buffer with persistent backing storage.[4] Bigtable. Google’s Bigtable is a distributed storage system for managing structured data designed to be highly scalable. This system has proven its efficiency in important applications from Google: Personalized Search, Google Analytics, Google Earth, Google Finance. Bigtable does not support a full relational data model; instead it provides clients with a simple data model indexed using row, columns and timestamps. From the data model point of view Bigtable is a sparse, distributed persistent, multi-dimensional sorted map where each value in the map is an uninterpreted array of bytes. Row keys are arbitrary strings and data in Bigtable is maintained in lexicographic order by row keys and every read/write under a row key is considered an atomic operation regardless the number of columns involved. Columns are grouped in sets called column families and usually these contain information of the same type. Timestamps are introduced because each cell can contain multiple versions of the same data. Bigtable API provides functions for: creating and deleting tables and column families, reads/updates under a particular key and other operations involving cluster management. A master model is use to manage load balancing and fault tolerance. For internal persistence Bigtable uses SSTable (immutable sorted file of key-value pairs) file format in conjunction with GFS (Google File System). [5]
  • 4. 2.2 Design patterns [8], [11] API Model. Because the underlying data model can be considered as a large distributed hashtable (DHT) the basic API (Application Programming Interface) could be: - get(key) – extract the value at the given key . - put(key) – updates the values at the given key . - delete(key) – removes the key and its associated value . Machine Infrastructure. The infrastructure for these kind of systems is composed of a large number of machines with commodity hardware connected together through a network. Each machine (physical node) has the same software configuration, but the hardware characteristics may not be the same. Within each physical node there are a number of virtual nodes running. Partition Schemes. Most large scale distributed system uses a consistent hashing technique due to its flexibility when the number of virtual nodes is altered. When nodes are added or removed keys and data need to be redistributed and a consistent hashing technique minimizes the amount of these changes. In the consistent hashing technique the key space is finite; the output range of a hash function is treated as a fixed ring. Both virtual node ids and data items keys take values in this circular space and the owner of a set of keys identifying data items is considered as the first virtual node encountered walking the ring clockwise from that key. In case of virtual nodes crashes all the keys owned by the failing node will be adopted by its clockwise neighbor thus the rest of virtual nodes on the ring are not affected. Data Replication. In order to achieve high availability and performance same data need to be available on multiple nodes – replicas. In Dynamo [4] the list of nodes responsible for storing a particular key is called preference list and the size of this list is configured by a preset parameter. While read actions can be performed on any replica, update actions can lead to some consistency issues because the updates need to be propagated to all the replicas. Data Models. The basic data access method is to use a key in order to retrieve or update a value. Value can be: blob (binary large object) [4], document, column family (rows and columns, but the rows can have as few or as many columns as desired) [5], graph or collection. Storage Models. The most used strategy is to design this component in a pluggable fashion where storage mechanisms can be: MySQL DB, Berkeley DB, filesystem - SSTables, or in-memory storage – memtables. Consistency Management. The same data is available on multiple nodes at a given time and the problem that arises is to synchronize these replicas in order to preserve a consistent view of data from the client perspective. In such systems where availability and partition tolerance are an important requirement strict consistency cannot be achieved at the same time with first two properties (CAP Theorem) thus a form of
  • 5. weak consistency – eventual consistency is implemented in these systems. There are various mechanism that will guarantee such systems will eventually become consistent after a period of time (inconsistency window) during which synchronization is performed. Timestamps. Using the history of operations performed on a row of data can be decided to what value the row will eventually converge to. The drawbacks of this method are: requires synchronized clocks on nodes, don’t capture causality, a decision is hard to take when write operations happened simultaneously. Vector clocks. A vector clock is a tuple {t1, t2,…,tn} of clock values from each node. When a write operation is performed on node i it sets ti to its clock value. Given two vector clocks v1 and v2, v1 < v2 (if for all k v1[k] ≤ v2[k]) implies the global time ordering of events. There are certain rules that replicas follow when updating their vector clock: - when an internal operation happens at replica i it will advance its vector clock vi[i] - when replica i sends a message to replica j it also attaches its vector clock to the message - when replica j receives a message from replica i it will advance its clock vj[j] and then merge it with the vector clock received in the message vj[k] = max(vi[k], vi[k]) Single Master Model. In this model each data partition has a master node and multiple slave nodes. Updates are redirected to the master node and then, asynchronously, the update propagates to the slave nodes. Sometimes using this model a system can become unavailable if the master has failed and none of the replicas have been updated yet. Multi-Master Model. In certain key ranges intensive requests for updates will cause the Single Master Model to be unable to spread the workload correctly. Multi-Master Model allows updates to be performed at any replicas. Quorum Based 2PC. Assuming that there are N replicas of some data and a coordinator node, when an update is requested the coordinator sends the request to all the N replicas but it has to wait for only W (W < N) successful answers. The same happens in read actions, the coordinator sends the request to the N replicas, but has to wait only for R (R < N) successful responses and from all the answering nodes the one with the highest timestamp is selected. This protocol is flexible because configuring the W and R values accordingly different levels of consistency can be achieved: W+R>N – strict consistency, W+R ≤ N - the model of consistency is relaxed to a weaker one. Membership Management. Since nodes in a cluster may fail or recover the need for a technique that will allow nodes to know about each other arises. Omniscient Master. When nodes leave or join a cluster they communicate with a master node that holds the authoritative view of the cluster. This method is simple and provides a consistent view of cluster status, but these is still a single point of failure and the model is not partition tolerant. Gossip. This is a method to propagate cluster status to all the members. Every preset amount of time a node selects another to communicate its view about the cluster with. Every node maintains a timestamp of the information about itself and the rest of the
  • 6. cluster. This method is scalable and failure tolerant but provides eventual consistency about cluster status. 2.3 Open-Source Projects Dynamo [4] and Bigtable [5] constituted a great starting point for developing open- source, non-relational, distributed and horizontal scalable databases. NoSQL movement began in early 2009 and grows rapidly into a consistent list of free and competitive products providing most of necessary properties in distributed systems: schema-free, replication support, easy API, eventual consistency, performance. Bellow is presented a non-exhaustive list of current databases and their classifications along three important characteristics: scalability, data and query model, internal persistence model. Scalability Data and Query Model Persistence Model Add new Support Data Query API machines for multiple Model transparently datacenters to applications Cassandra Column Thrift Memtable/ family SStable HBase Column Thrift, REST Memtable/ family SStable on HDFS Riak Document Nested hashes ? Scalaris Key/value get/put in-memory only Voldemort under Key/value get/put BDB, MySQL development CouchDB Document map/reduce views append-only B- Tree MongoDB Document Cursor B-Tree Neo4j Graph Graph on-disk linked lists Redis Collection Collection in-memory Tokyo Key/value get/put hash or B-Tree Cabinet Chordless Key/value Java, simple RPC ?
  • 7. Add new Support Data Query API Persistence machines for multiple Model Model transparently datacenters to applications InfoGrid Graph Java, http/REST ? Sones Graph .Net ? Table. 1. Classification by scalability, data and query model and persistence model [1], [13] This table summarizes the most important characteristics of a subset from non- relational database systems currently available. The rest of this section will focus on describing some of these databases. Cassandra. This system development started at Facebook and one of its designers was a co-author of Dynamo. At the moment the project is open source and still under “heavy development” at The Apache Software Foundation. Their authors define it as a “structured storage system over a P2P network”. [11] This system combines the distributed architecture of Dynamo and the column family model from Bigtable. From the data model point of view Cassandra it is a multi-dimensional map indexed by a key where each application creates its own key space. Besides column family a new concept of super columns is introduced which represents lists of columns. Data is sorted at write operations and also within a row columns are sorted by their name. Partitioning subsystem is similar to Dynamo approach - consistent hashing is used. The same concepts of coordinator node and preference list as in Dynamo are used for data replication. Cluster management uses a variant of Gossip technique – Scuttlebutt anti-entropy Gossip. Internal persistence relies on the local file system and storage structure is similar to the one in Bigtable: SSTable, memtable, commit logs, compaction and Bloom filters. The system is written in Java and high level libraries are available for: Ruby, Perl, Python, Scala. Facebook, Digg and Rackspace use this system in production. [11], [12] Voldemort. Key-value store systems developed by Linkenin engineers implements most of the features available in Dynamo: partition and replication (consistent hashing, preference list), object versioning (vector clocks), pluggable storage component (BDB, in-memory, MySQL). Voldemort also comes with a series of new features: serialization, support for read-only nodes, compression. Linkedin uses this system as its underlying storage system. [11], [12] Riak. Key-value store system that uses documents as values, using the same architecture and algorithms as Dynamo. Implementation is done in Erlang and various client libraries are available: Jiak Client (Erlang (JSON)), Riak (Erlang (raw)), Pyhton, PHP, Ruby, Java, JavaScript. There are no known examples of usages in production. [12] Redis. Key-value store where values can have multiple types: strings, lists, sets, ordered sets. Replication is achieved via a Master – Slave model, client libraries
  • 8. (available in PHP, Ruby, Scala) are responsible for partitioning. It uses a memory- driven approach with asynchronously snapshots to disk for local persistence. Some other supported operations depend on the values data types: increments, decrements, atomic multi-set (Strings); push, pop , range get (Lists); intersection, union, difference (Sets), sorting. It is written in ANSI C and it is used in production at: Github, Engine Yard, VideoWiki. [12] Neo4j. This is a disk-based (data is stores in a custom binary format), fully transactional Java persistence engine that stores data structures in graphs. Some of its most important features are: graph-oriented mode for data representation (stores, nodes, relationships and properties), high scalability (both across the same machine but also on multiple machines), OO simple Java API, optional layers to expose itself as a RDS Store, express meta model semantics using OWL, query the graph using SPARQL. [14] 3 NoSQL in the social and semantic Web context Semantic Web is an initiative of the World Wide Web Consortium (W3C) which involves transforming the Web so that the data available today can be understood and reused by machines. On a less abstract level this means attaching meta-data to the resources on the Web and to specify relationships between these resources. The core of the Semantic Web is a set of design principles, standards already widely used on the Web - XML, XML Schema, formal definitions of language used in expressing data models - Resource Description Framework (RDF), vocabulary for describing properties of models based on RDF - Resource Description Framework Schema (RDFS), vocabulary for creating ontologies - Ontology Web Language (OWL), data query services - SPARQL and other, under development, standards - Rule Interchange Format (RIF ), Unifying Logic and Proof layers. Social Web is the term used to describe how people socialize and interact each other throughout the WWW. Classic examples of distributed web applications that favored development of large social networks are: Facebook, MySpace, Linkedin, Flickr, Twitter, Del.icio.us, etc. Regarding NoSQL influence on Semantic Web the vast list of database system developed, each exposing new techniques of managing data, contains some examples that may address problems like: managing RDF stores, managing ontologies or creating SPARQL endpoints. Neo4j is probably the most obvious example of such a store system. Its graph- oriented data model makes it perfect to store RDF triples or complex ontologies. Despite the fact that databases using this graph-oriented data view are able to manage a much reduced volume of information that the other types of non-relational data stores (key-value, column family, documents) this volume is still a large one: billions of nodes and relationships. Neo4j developers affirm that the traversal component of this system is a high-performance one and it’s over 6 years of enrolment in production rises the degree of confidence in this system. [14] HBase (The Hadoop Database) is a scalable, distributed, column oriented, dynamic schema database for structured data, modeled after Google Bigtable and under
  • 9. development at ASF (Apache Software Foundation). HBase data model can be viewed as a multi-dimensional map where values are indexed by 4 keys (TableName, RowKey, ColumnKey, Timestamp). Values are binary data, rows are sorted in lexicographic order and columns are grouped in column families. The database schema is flexible and it can be modified at run-time. Such a dynamic schema allows this system to store Semantic Web data. An example of such a modeling can be found in [17]. Applications in the Social Web sphere have a longer history than Semantic Web applications so the scalability, performance, availability or huge volumes of data became issues vital to these applications. Cassandra, one of the most important non- relational distributed stores, is already used in production in large social applications: Facebook, Digg. A comparison with MySQL on 50 GB of data shows that Cassandra performs better. [11] Read Write MySQL ~350 ms ~300 ms Cassandra 0.12 ms 15 ms Table. 2. Performance comparison between MySQL and Cassandra on 50 GB of data 4 Conclusion RDMS have served large informational systems for over 30 year but current amount of data that needs to be managed causes multiple problems with these systems. In order to address problems like: scalability, performance, availability a new set of technologies and non-relational databases have been developed and they are collectively known under the term NoSQL. This paper presents the techniques and design practices that lye under these new database products most of which are inspired from already existing and reliable systems like Amazon’s Dynamo and Google’s Bigtable. Also few ideas on how these systems already influence the applications development for the semantic and social Web are expressed. The NoSQL trend began to grow rapidly in early 2009 and within a relatively short period of time a big number of non-relational database solutions appeared and part of them already became components of various large scale applications. As future research we are thinking at studying in even great detail the current techniques used in designing such system and possibly eliminating the vulnerabilities that may cause some of them to fail in certain scenarios.
  • 10. References 1. Ellis, Jonathan: NoSQL Ecosystem, http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/ (2009) 2. Shoup, Randy: eBay Marketplace Architecture: Architectural Strategies, Patterns, and Focuses (2007) 3. Bloor, Robin: 6 Reason Why Relational Database Will Be Superseded (2008) 4. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G.,Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s Highly Available Key-value Store (2007) 5. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach , D. A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. E.: Bigtable: A Distributed Storage System for Structured Data (2006) 6. Brewer, Eric A.: Towards Robust Distributed Systems, Principles Of Distributed Computing (2000) 7. Vogels, W: Eventually Consistent, http://www.allthingsdistributed.com/2008/12/eventually_consistent.html (2008) 8. Ho, Ricky: Pragmatic Programming Techniques, http://horicky.blogspot.com/2009/11/nosql-patterns.html (2009) 9. Wiggins, Adam: SQL Databases Don’t Scale, http://adam.blog.heroku.com/past/2009/7/6/sql_databases_dont_scale/ (2009) 10. Browne, Julian: Brewer’s CAP Theorem, http://www.julianbrowne.com/article/viewer/brewers-cap-theorem (2009) 11. NOSQL debrief, http://blog.oskarsson.nu/2009/06/nosql-debrief.html (2009) 12. Gupta, Vineet: NoSQL Databases – Part 1- Landscape, http://www.vineetgupta.com/2010/01/nosql-databases-part-1-landscape.html (2010) 13. NoSQL – Your Ultimate Guide to Non – Relational Universe, http://nosql- databases.org/ 14. Neo4j – the graph database, http://neo4j.org/ 15. Semantic Web, http://en.wikipedia.org/wiki/Semantic_Web 16. Social Web, http://en.wikipedia.org/wiki/Social_web 17. Mateescu, Gabriel: Finding the way through the semantic Web with HBase, http://www.ibm.com/developerworks/opensource/library/os- hbase/index.html?ca=dgr-twtrHBasedth- OS&S_TACT=105AGY83&S_CMP=TWDW (2009)