This document provides an introduction to NoSQL and data scalability techniques. It discusses key concepts like data scalability and horizontal scaling. The two main architectures for scalable data systems are data grids and NoSQL systems. Example NoSQL implementations discussed include MongoDB, which uses a document-oriented data model, and GigaSpaces XAP, a data grid platform. The document provides examples of using MongoDB with Python and discusses some advantages and disadvantages of MongoDB compared to SQL databases.
1. #105
Get More Refcardz! Visit refcardz.com
Getting Started with
CONTENTS INCLUDE:
n
Introduction
n
Scalable Data Architecture
NoSQL and Data Scalability
n
Is NoSQL For You?
n
mongoDB
n
GigaSpaces XAP
Google App Engine Datastore and more...
By Eugene Ciurana
n
Implementations of either often share characteristics from the
INTRODUCTION other.
The DZone Refcard #43 is an introduction to system high Data Grids
availability and scalability terminology and techniques Data grids process workloads defined as independent jobs
(http://refcardz.dzone.com/refcardz/scalability). The next that don’t require data sharing among processes. Storage
logical step is the scalable handling of massive data volumes or network may be shared across all nodes of the grid, but
resulting from having these powerful processing capabilities. intermediate results have no bearing on other jobs progress or
on other nodes in the grid, such as a MapReduce cluster.
This Refcard demystifies NoSQL and data scalability
techniques by introducing some core concepts. It also offers
an overview of current technologies available in this domain
Consumer
and suggests how to apply them.
What is Data Scalability?
Master
Data scalability is the ability of a system to store, manipulate,
analyze, and otherwise process ever increasing amounts of
www.dzone.com
data without reducing overall system availability, performance, Load Balancer
or throughput.
Data scalability is achieved by a combination of more powerful
Node Node Node Node
processing capabilities and larger but efficient storage
mechanisms.
Relational and hierarchical databases scale up by adding more Load Balancer
processors, more storage, caching systems, and such. Soon
they hit either a cost or a practical scalability limit because
they are difficult or impossible to scale out. These database Node Node Node Node
management systems are designed as single units that must
maintain data integrity, and enforce schema rules to guarantee Figure 1 - Data Grid
it. This rigidity is what promotes upward, but not outward,
Getting Started with NoSQL and Data Scalability
scalability. Data grids expose their functionality through a single API
(either a Web service or native to the application programming
Oracle RAC is a cluster of multiple computers with language) that abstracts its topology and implementation from
access to a common database. This is considered its data processing consumer.
Hot only vertically scalable because the processing
Tip (usually in the form of stored procedures) may scale
out, but the shared storage facilities don’t scale with
the cluster.
Get over 90 DZone Refcardz
Data integrity and schemas are suited for handling
transactional, normalized, uniform data. They handle FREE from Refcardz.com!
unstructured or rapidly evolving data structures with difficulty
or exponentially larger costs.
Hot Data replication is not the same as data scalability!
Tip
SCALABLE DATA ARCHITECTURES
There are two general kinds of architectures used for building
scalable data systems: data grids and NoSQL systems.
DZone, Inc. | www.dzone.com
2. 2
Getting Started with NoSQL and Data Scalability
Areas of Application NoSQL vs RDBMS vs OO Analogies
• Financial modeling NoSQL RDBMS OO
• Data mining
Kind Table Class
• Click stream analytics
Entity Record Object
• Document clustering
• Distributed sorting or grepping Attribute Column Property
• Simulations
• Inverted index construction
IS NoSQL FOR YOU?
• Protein folding
NoSQL Preparation:
NoSQL describes a horizontally scalable, non-relational
• Don’t fall prey to the NoSQL fad
database with built-in replication support. Applications
• Don’t be stubborn; neither NoSQL nor traditional
interact with it through a simple API, and the data is stored in a
databases apply to all cases
“schema-free”, flat addressing repository, usually as large files
• Apply the CAP Theorem to your use cases to
or data blocks. The repository is often a custom file system
determine feasibility
designed to support NoSQL operations.
Brewer’s (CAP) Theorem
NoSQL is intended as shorthand for “not only It’s impossible for a distributed computer system to
Hot SQL.” Complete architectures almost always mix simultaneously provide all three of these guarantees:
Tip
traditional and NoSQL databases. • Consistency (all nodes see the same data at the same time)
• Availability (node failures don’t prevent survivors from
NoSQL setups are best suited for non-OLTP applications that continuing to operate)
process massive amounts of structured and unstructured data • Partition tolerance (no failures less than total network
at a lower cost and with higher efficiency than RDBMs and failures cause the system to fail)
stored procedures.
Since only two of these characteristics are guaranteed for any
given scalable system, use your functional specification and
Consumer business SLA (service level agreement) to determine what your
minimum and target goals for CAP are, pick the two that meet
your requirements, and proceed to implement the appropriate
Node Node Node Node technology.
Rule of Thumb: NoSQL’s primary goal is to achieve
Virtual File System
logical table management, load balancing, garbage collection
Hot horizontal scalability. It attains this by reducing
(HDFS, mongoFS, Hypertable) Tip
transactional semantics and referential integrity.
Tablet Tablet Tablet
Server 0 Server 1 Server n
Use Figure 3 to identify the best match between your
Distributed File System
application’s CAP requirements and the suggested SQL and
NoSQL systems listed.
FS 0 FS 1 FS 2 FS n
Figure 2 - NoSQL Topology Relational
Key-Value
Column-Oriented
NoSQL storage is highly replicated (a commit doesn’t occur Document-Oriented
until the data is successfully written to at least two separate
Consistency Availability
storage devices) and the file systems are optimized for write-
C A
RDBMs (Oracle, MySQL), Aster Data, Green Plum, Vertica
only commits. The storage devices are formatted to handle
large blocks (32 MB or more). Caching and buffering are
a,
mo
designed for high I/O throughput. The NoSQL database
dr
ng edi
an
oD s,
s
R
is implemented as a data grid for processing (mapReduce,
as
B, Be
ak C
Pick Any Two
Te rke
Ri I,
queries, CRUD, etc.)
B, , KA
rra ley
sto D
hD et
re B,
uc bin
Areas of Application
,D M
Co Ca
ata em
B, yo
• Document storage
sto cac
leD Tok
re he
• Object databases
mp t,
, H DB
Si mor
yp , S
lde
er
• Graph databases
tab cala
Vo
le, ris
• Key/value stores
o,
m
Hb
na
P
as
• Eventually consistent key/value stores
Dy
, e
This list is the implementation counterpart to the data grid Partition tolerance
areas of application; the data store feeds the computational Figure 3 - CAP Selection Chart
network and, together, they form the NoSQL database. (source: Nathan Hurst's Blog)
DZone, Inc. | www.dzone.com
3. 3
Getting Started with NoSQL and Data Scalability
encoded JSON representation. BSON is designed to be
mongoDB traversable, lightweight and efficient. Applications can map
BSON/JSON documents using native representations like
mongoDB is a document-based NoSQL database that bridges dictionaries, lists, and arrays, leaving the BSON translation
the gap between scalable key-value stores like Datastore to the native mongoDB driver specific to each programming
and Memcache DB, and RDBMS’s querying and robustness language.
capabilities. Some of its main features include:
BSON Example
• Document-oriented storage - data is manipulated as {
JSON-like documents 'name' : 'Tom',
'age' : 42
• Querying - uses JavaScript and has APIs for submitting }
queries in every major programming language
• In-place updates - atomicity Language Representation
• Indexing - any attribute in a document may be used for Python {
'name' : 'Tom',
indexing and query optimization 'age' : 42
• Auto-sharding - enables horizontal scalability }
• Map/reduce - the mongoDB cluster may run smaller Ruby {
"name" => "Tom",
MapReduce jobs than a Hadoop cluster with significant "age" => 42
}
cost and efficiency improvements
Java BasicDBObject d;
mongoDB implements its own file system to optimize I/O d = new BasicObject();
d.put("name", "Tom");
throughput by dividing larger objects into smaller chunks. d.put("age", 42);
Documents are stored in two separate collections: files PHP array( "name" => "Tom",
containing the object meta-data, and chunks that form a "age" => 42);
larger document when combined with database accounting Dynamic languages offer a closer object mapping to BSON/
information. The mongoDB API provides functions for JSON than compiled languages.
manipulating files, chunks, and indices directly. The
administration tools enable GridFS maintenance. The complete BSON specification is available from:
http://bsonspec.org/
Consumer mongoDB Programming
Programming in mongoDB requires an active server running
fail-over
the mongod and the mongos database daemons (see Figure
5), and a client application that uses one of the language-
mongoDB Server (master) mongoDB Server (slave)
mongod mongod
specific drivers.
Database Database
daemon daemon
mongos mongos
Sharding
daemon
Sharding
daemon Hot All the examples in this Refcard are written in Python
Data Data
Tip for conciseness.
Storage Storage
Starting the Server
Figure 5 - mongoDB Cluster
Log on to the master server and execute:
A mongoDB cluster consists of a master and a slave. The slave [servername:user] ./mongod
may become the master in a fail-over scenario, if necessary.
Having a master/slave configuration (also known as Active/ The server will display its status messages to the console
Passive or A/P cluster) helps ensure data integrity since only unless stdout is redirected elsewhere.
the master is allowed to commit changes to the store at any Programming Example
given time. A commit is successful only if the data is written to This example allocates a database if one doesn’t already exist,
GridFS and replicated in the slave. instantiates a collection on the server, and runs a couple of
queries.
mongoDB also supports a limited master/master
configuration. It’s useful only for inserts, queries, The mongoDB Developer Manual is available from:
Hot and deletions by specific object ID. It must not http://www.mongodb.org/display/DOCS/Manual
Tip
be used if updates of a single object may occur #!/usr/bin/env jython
concurrently. import pymongo
from pymongo import Connection
Caching
connection = Connection('servername', 27017)
mongoDB has a built-in cache that runs directly in the cluster
without external requirements. Any query is transparently db = connection['people_database']
cached in RAM to expedite data transfer rates and to reduce peopleList = db['people_list']
disk I/O. person = {
'name' : 'Tom',
Document Format 'age' : 42 }
mongoDB handles documents in BSON format, a binary-
DZone, Inc. | www.dzone.com
4. 4
Getting Started with NoSQL and Data Scalability
peopleList.insert(person) • JSON data and program objects storage - many RESTful
web services provide JSON data; they can be stored in
person = {
'name' : 'Nancy', mongoDB without language serialization overhead
'age' : 69 }
(especially when compared against XML documents)
peopleList.insert(person) • Content management systems - JSON/BSON objects can
# find first entry: represent any kind of document, including those with a
person = peopleList.find_one()
binary representation
# find a specific person:
mongoDB Drawbacks
person = peopleList.find_one({ 'name' : 'Joe'})
• No JOIN operations - each document is stand-alone
if person is None: • Complex queries - some complex queries and indices are
print "Joe isn’t here!"
else: better suited for SQL
print person['age']
• No row-level locking - unsuitable for transactional data
# bulk inserts without error prone application-level assistance
persons = [{ 'name' : 'Joe' }, {'name' : 'Sue'}]
peopleList.insert(persons) If any of these is part of the functional requirements, a SQL
database would be better suited for the application.
# queries with multiple results
for person in peopleList.find():
print person['name'] GIGASPACES XAP
for person in peopleList.find({'age' : {'$ge' : 21}}).sort('name'):
print person['name'] The GigaSpaces eXtreme Application Platform is a data
# count: grid designed to replace traditional application servers. It
nDrinkingAge = peopleList.find({'age' : {'$ge' : 21}}).count()
operates based on an event-processing model where the
application dispatches objects to the processing nodes
# indexing
associated with a given data partition. The system may be
from pymongo import ASCENDING, DESCENDING
configured so that data state on the grid may trigger events, or
peopleList.create_index([('age', DESCENDING), ('name', ASCENDING)]) an application may dispatch specific commands imperatively.
GigaSpaces XAP also manages all threading and execution
The PyMongo documentation is available at: aspects of the operation, including thread and connection
http://api.mongodb.org/python - guides for other languages pools. GigaSpaces XAP implements Spring transactions and
are also available from this web site. auto-recovery. The system detects any failed operations
The code in the previous example performs these operations: in a computational node and automatically rolls back the
transaction; it then places it back in the space where another
• Connect to the database server started in the previous
node picks it up to complete processing.
section
• Attach a database; notice that the database is treated like
Application Frameworks
an associative array
• Get a collection (loosely equivalent to a table in a Java C++ .Net Groovy
relational database), treated like an associative array
Mule Spring JEE Jetty
• Insert one or more entities
• Query for one or more entites
Although mongoDB treats all these data as BSON internally,
most of the APIs allow the use of dictionary-style objects to
streamline the development process. XAP Deployment Virtualization
XAP
Object ID Management
A successful insertion into the database results in a valid and XAP Middleware Virtualization
Monitoring
Object ID. This is the unique identifier in the database for a (Virtualized Clustering Layer)
given document. When querying the database, a return value
will include this attribute:
{
"name" : "Tom",
"age" : 42,
"_id" : ObjectId('999999')
} RDBMS Memcache DB mongoDB
Users may override this Object ID with any argument as long as
it’s unique, or allow mongoDB to assign one automatically.
Figure 4 - GigaSpaces XAP Data Grid
Common Use Cases
• Caching - more robust capabilities, plus persistence, when GigaSpaces provides data persistence, distributed processing,
compared against a pure caching system and caching by interfacing with SQL and NoSQL data stores.
• High volume processing - RDBMS may be too expensive The GigaSpaces API abstracts all back-end operations (job
or slow to run in comparison dispatching, data persistence, and caching) and makes
DZone, Inc. | www.dzone.com
5. 5
Getting Started with NoSQL and Data Scalability
it transparent to the application. GigaSpaces XAP may
implement distributed computing operations like MapReduce GOOGLE APP ENGINE DATASTORE
and run them in its nodes, or it may dispatch them for
processing to the underlying subsystem if the functionality The Datastore is the main scalability feature of Google App
is available (e.g. mongoDB, Hadoop). The application may Engine applications. It’s not a relational database or a façade
implement transactions using the Spring API, the GigaSpaces for one. Datastore is a public API for accessing Google’s
XPA transactional facilities, or by implementing a workflow Bigtable high-performance distributed database system.
where specific NoSQL stores handle entities or groups of Think of it as a sparse array distributed across multiple servers
entities. This must be implemented explicitly in systems like that also allows an infinite number of columns and rows.
mongoDB, but may exist in other NoSQL systems like Google Applications may even define new columns “on the fly”. The
App Engine’s Datastore. Datastore scales by adding new servers to a cluster; Google
provides this functionality without user participation.
GigaSpaces XAP Programming Google Your
Applications Applications
The GigaSpaces XAP API is very rich and covers many aspects
beyond NoSQL and data scalability areas like configuration Datastore API 1 Datastore
Java (JDO, JPA) Other language Python
management, deployment, web services, third-party product
Bigtable
integration, etc. Master Server
(Logical table management, load balancing, garbage collection)
The GigaSpaces XAP documentation is at:
Tablet Tablet Tablet
http://www.gigaspaces.com/wiki/display/XAP71/Programmer%27s+Guide Server 0 Server 1 Server n
NoSQL operations may be implemented over these APIs: Google File System
• SQLQuery - allows querying a space using a SQL-like FS 0 FS 1 FS 2 FS n
syntax and regular expressions; do not confuse it with
Figure 6 - Datastore Architecture
JDBC support.
Datastore operations are defined around entities. Entities
• Persistency - mostly supports RDBMSs but may implement
can have one-to-many or many-to-many relationships. The
other persistency mechanisms through the External Data
Datastore assigns unique IDs unless the application specifies
Source Components API.
a unique key. Datastore also disallows some property names
• memcached - support for key/value pair distributed that it uses for housekeeping. The complete Datastore
dictionaries available to any client in the grid; entities documentation is available from:
are automatically made available across all nodes. The http://code.google.com/appengine/docs/python/datastore/
memcached API is implemented on top of the data
grid, and it’s interchangeable with other memcached Did you notice the parallels between Datastore
implementations. Hot and mongoDB so far? Many NoSQL database
• Task Execution - allows synchronous and asynchronous job Tip implementations have solved similar problems in
execution on specific nodes or clusters similar ways.
The GigaSpaces XAP API is in a minority of stateful NoSQL Transactions and Entity Groups
systems. Most NoSQL systems strive to achieve statelessness Datastore supports transactions. These transactions are only
to increase scalability and data consistency. effective on entities that belong to the same entity group.
Entities in a given group are guaranteed to be stored on the
Common Use Cases same server.
• Real-time analytics - dynamic data analysis and reporting
Datastore Programming
based on data entered into a system less than a minute
The programming model is based on inheritance of basic
before effective time of use
entities, db.Model and db.Expando. Persistent data is
• Map/reduce - distributed data processing of large data mapped onto an entity specialization of either of these classes.
sets across a computational grid The API provides persistence and querying instance methods
• Near-zero downtime - allows for database schema for every entity managed by the Datastore.
changes without homebrew master/slave configurations or
Programming Example
proprietary RDBMS dependencies
The Datastore API is simpler than other NoSQL APIs and is
GigaSpaces XAP NoSQL Drawbacks highly optimized to work in the App Engine environment. In
• Complexity - the server, transactional, and grid model are this example we insert data into the data store, then run a query:
more complex than for other NoSQL systems from google.appengine.ext import db
• Application server model - the API and components are class Person(db.Model):
name = db.StringProperty(required=True)
geared toward building applications and transactional age = db.IntegerProperty(require=True)
logic
person = Person(name = 'Tom', age = 42)
• Steeper learning curve
person.put()
• Higher TCO - brings a requirement of a specialized,
person = Person(name = 'Sue', age = 69)
well-trained system administration team with higher
requirements than other NoSQL systems person.put()
DZone, Inc. | www.dzone.com