No sql and data scalability


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

No sql and data scalability

  1. 1. #105 Get More Refcardz! Visit Getting Started with CONTENTS INCLUDE: n Introduction n Scalable Data Architecture NoSQL and Data Scalability n Is NoSQL For You? n mongoDB n GigaSpaces XAP Google App Engine Datastore and more... By Eugene Ciurana n Implementations of either often share characteristics from the INTRODUCTION other. The DZone Refcard #43 is an introduction to system high Data Grids availability and scalability terminology and techniques Data grids process workloads defined as independent jobs ( The next that don’t require data sharing among processes. Storage logical step is the scalable handling of massive data volumes or network may be shared across all nodes of the grid, but resulting from having these powerful processing capabilities. intermediate results have no bearing on other jobs progress or on other nodes in the grid, such as a MapReduce cluster. This Refcard demystifies NoSQL and data scalability techniques by introducing some core concepts. It also offers an overview of current technologies available in this domain Consumer and suggests how to apply them. What is Data Scalability? Master Data scalability is the ability of a system to store, manipulate, analyze, and otherwise process ever increasing amounts of data without reducing overall system availability, performance, Load Balancer or throughput. Data scalability is achieved by a combination of more powerful Node Node Node Node processing capabilities and larger but efficient storage mechanisms. Relational and hierarchical databases scale up by adding more Load Balancer processors, more storage, caching systems, and such. Soon they hit either a cost or a practical scalability limit because they are difficult or impossible to scale out. These database Node Node Node Node management systems are designed as single units that must maintain data integrity, and enforce schema rules to guarantee Figure 1 - Data Grid it. This rigidity is what promotes upward, but not outward, Getting Started with NoSQL and Data Scalability scalability. Data grids expose their functionality through a single API (either a Web service or native to the application programming Oracle RAC is a cluster of multiple computers with language) that abstracts its topology and implementation from access to a common database. This is considered its data processing consumer. Hot only vertically scalable because the processing Tip (usually in the form of stored procedures) may scale out, but the shared storage facilities don’t scale with the cluster. Get over 90 DZone Refcardz Data integrity and schemas are suited for handling transactional, normalized, uniform data. They handle FREE from! unstructured or rapidly evolving data structures with difficulty or exponentially larger costs. Hot Data replication is not the same as data scalability! Tip SCALABLE DATA ARCHITECTURES There are two general kinds of architectures used for building scalable data systems: data grids and NoSQL systems. DZone, Inc. |
  2. 2. 2 Getting Started with NoSQL and Data ScalabilityAreas of Application NoSQL vs RDBMS vs OO Analogies • Financial modeling NoSQL RDBMS OO • Data mining Kind Table Class • Click stream analytics Entity Record Object • Document clustering • Distributed sorting or grepping Attribute Column Property • Simulations • Inverted index construction IS NoSQL FOR YOU? • Protein foldingNoSQL Preparation:NoSQL describes a horizontally scalable, non-relational • Don’t fall prey to the NoSQL faddatabase with built-in replication support. Applications • Don’t be stubborn; neither NoSQL nor traditionalinteract with it through a simple API, and the data is stored in a databases apply to all cases“schema-free”, flat addressing repository, usually as large files • Apply the CAP Theorem to your use cases toor data blocks. The repository is often a custom file system determine feasibilitydesigned to support NoSQL operations. Brewer’s (CAP) Theorem NoSQL is intended as shorthand for “not only It’s impossible for a distributed computer system to Hot SQL.” Complete architectures almost always mix simultaneously provide all three of these guarantees: Tip traditional and NoSQL databases. • Consistency (all nodes see the same data at the same time) • Availability (node failures don’t prevent survivors fromNoSQL setups are best suited for non-OLTP applications that continuing to operate)process massive amounts of structured and unstructured data • Partition tolerance (no failures less than total networkat a lower cost and with higher efficiency than RDBMs and failures cause the system to fail)stored procedures. Since only two of these characteristics are guaranteed for any given scalable system, use your functional specification and Consumer business SLA (service level agreement) to determine what your minimum and target goals for CAP are, pick the two that meet your requirements, and proceed to implement the appropriate Node Node Node Node technology. Rule of Thumb: NoSQL’s primary goal is to achieve Virtual File System logical table management, load balancing, garbage collection Hot horizontal scalability. It attains this by reducing (HDFS, mongoFS, Hypertable) Tip transactional semantics and referential integrity. Tablet Tablet Tablet Server 0 Server 1 Server n Use Figure 3 to identify the best match between your Distributed File System application’s CAP requirements and the suggested SQL and NoSQL systems listed. FS 0 FS 1 FS 2 FS n Figure 2 - NoSQL Topology Relational Key-Value Column-OrientedNoSQL storage is highly replicated (a commit doesn’t occur Document-Orienteduntil the data is successfully written to at least two separate Consistency Availabilitystorage devices) and the file systems are optimized for write- C A RDBMs (Oracle, MySQL), Aster Data, Green Plum, Verticaonly commits. The storage devices are formatted to handlelarge blocks (32 MB or more). Caching and buffering are a, modesigned for high I/O throughput. The NoSQL database dr ng edi an oD s, s Ris implemented as a data grid for processing (mapReduce, as B, Be ak C Pick Any Two Te rke Ri I,queries, CRUD, etc.) B, , KA rra ley sto D hD et re B, uc binAreas of Application ,D M Co Ca ata em B, yo • Document storage sto cac leD Tok re he • Object databases mp t, , H DB Si mor yp , S lde er • Graph databases tab cala Vo le, ris • Key/value stores o, m Hb na P as • Eventually consistent key/value stores Dy , eThis list is the implementation counterpart to the data grid Partition toleranceareas of application; the data store feeds the computational Figure 3 - CAP Selection Chartnetwork and, together, they form the NoSQL database. (source: Nathan Hursts Blog) DZone, Inc. |
  3. 3. 3 Getting Started with NoSQL and Data Scalability encoded JSON representation. BSON is designed to be mongoDB traversable, lightweight and efficient. Applications can map BSON/JSON documents using native representations likemongoDB is a document-based NoSQL database that bridges dictionaries, lists, and arrays, leaving the BSON translationthe gap between scalable key-value stores like Datastore to the native mongoDB driver specific to each programmingand Memcache DB, and RDBMS’s querying and robustness language.capabilities. Some of its main features include: BSON Example • Document-oriented storage - data is manipulated as { JSON-like documents name : Tom, age : 42 • Querying - uses JavaScript and has APIs for submitting } queries in every major programming language • In-place updates - atomicity Language Representation • Indexing - any attribute in a document may be used for Python { name : Tom, indexing and query optimization age : 42 • Auto-sharding - enables horizontal scalability } • Map/reduce - the mongoDB cluster may run smaller Ruby { "name" => "Tom", MapReduce jobs than a Hadoop cluster with significant "age" => 42 } cost and efficiency improvements Java BasicDBObject d;mongoDB implements its own file system to optimize I/O d = new BasicObject(); d.put("name", "Tom");throughput by dividing larger objects into smaller chunks. d.put("age", 42);Documents are stored in two separate collections: files PHP array( "name" => "Tom",containing the object meta-data, and chunks that form a "age" => 42);larger document when combined with database accounting Dynamic languages offer a closer object mapping to BSON/information. The mongoDB API provides functions for JSON than compiled languages.manipulating files, chunks, and indices directly. Theadministration tools enable GridFS maintenance. The complete BSON specification is available from: Consumer mongoDB Programming Programming in mongoDB requires an active server running fail-over the mongod and the mongos database daemons (see Figure 5), and a client application that uses one of the language- mongoDB Server (master) mongoDB Server (slave) mongod mongod specific drivers. Database Database daemon daemon mongos mongos Sharding daemon Sharding daemon Hot All the examples in this Refcard are written in Python Data Data Tip for conciseness. Storage Storage Starting the Server Figure 5 - mongoDB Cluster Log on to the master server and execute:A mongoDB cluster consists of a master and a slave. The slave [servername:user] ./mongodmay become the master in a fail-over scenario, if necessary.Having a master/slave configuration (also known as Active/ The server will display its status messages to the consolePassive or A/P cluster) helps ensure data integrity since only unless stdout is redirected elsewhere.the master is allowed to commit changes to the store at any Programming Examplegiven time. A commit is successful only if the data is written to This example allocates a database if one doesn’t already exist,GridFS and replicated in the slave. instantiates a collection on the server, and runs a couple of queries. mongoDB also supports a limited master/master configuration. It’s useful only for inserts, queries, The mongoDB Developer Manual is available from: Hot and deletions by specific object ID. It must not Tip be used if updates of a single object may occur #!/usr/bin/env jython concurrently. import pymongo from pymongo import ConnectionCaching connection = Connection(servername, 27017)mongoDB has a built-in cache that runs directly in the clusterwithout external requirements. Any query is transparently db = connection[people_database]cached in RAM to expedite data transfer rates and to reduce peopleList = db[people_list]disk I/O. person = { name : Tom,Document Format age : 42 }mongoDB handles documents in BSON format, a binary- DZone, Inc. |
  4. 4. 4 Getting Started with NoSQL and Data Scalability peopleList.insert(person) • JSON data and program objects storage - many RESTful web services provide JSON data; they can be stored in person = { name : Nancy, mongoDB without language serialization overhead age : 69 } (especially when compared against XML documents) peopleList.insert(person) • Content management systems - JSON/BSON objects can # find first entry: represent any kind of document, including those with a person = peopleList.find_one() binary representation # find a specific person: mongoDB Drawbacks person = peopleList.find_one({ name : Joe}) • No JOIN operations - each document is stand-alone if person is None: • Complex queries - some complex queries and indices are print "Joe isn’t here!" else: better suited for SQL print person[age] • No row-level locking - unsuitable for transactional data # bulk inserts without error prone application-level assistance persons = [{ name : Joe }, {name : Sue}] peopleList.insert(persons) If any of these is part of the functional requirements, a SQL database would be better suited for the application. # queries with multiple results for person in peopleList.find(): print person[name] GIGASPACES XAP for person in peopleList.find({age : {$ge : 21}}).sort(name): print person[name] The GigaSpaces eXtreme Application Platform is a data # count: grid designed to replace traditional application servers. It nDrinkingAge = peopleList.find({age : {$ge : 21}}).count() operates based on an event-processing model where the application dispatches objects to the processing nodes # indexing associated with a given data partition. The system may be from pymongo import ASCENDING, DESCENDING configured so that data state on the grid may trigger events, or peopleList.create_index([(age, DESCENDING), (name, ASCENDING)]) an application may dispatch specific commands imperatively. GigaSpaces XAP also manages all threading and executionThe PyMongo documentation is available at: aspects of the operation, including thread and connection - guides for other languages pools. GigaSpaces XAP implements Spring transactions andare also available from this web site. auto-recovery. The system detects any failed operationsThe code in the previous example performs these operations: in a computational node and automatically rolls back the transaction; it then places it back in the space where another • Connect to the database server started in the previous node picks it up to complete processing. section • Attach a database; notice that the database is treated like Application Frameworks an associative array • Get a collection (loosely equivalent to a table in a Java C++ .Net Groovy relational database), treated like an associative array Mule Spring JEE Jetty • Insert one or more entities • Query for one or more entitesAlthough mongoDB treats all these data as BSON internally,most of the APIs allow the use of dictionary-style objects tostreamline the development process. XAP Deployment Virtualization XAPObject ID ManagementA successful insertion into the database results in a valid and XAP Middleware Virtualization MonitoringObject ID. This is the unique identifier in the database for a (Virtualized Clustering Layer)given document. When querying the database, a return valuewill include this attribute: { "name" : "Tom", "age" : 42, "_id" : ObjectId(999999) } RDBMS Memcache DB mongoDBUsers may override this Object ID with any argument as long asit’s unique, or allow mongoDB to assign one automatically. Figure 4 - GigaSpaces XAP Data GridCommon Use Cases • Caching - more robust capabilities, plus persistence, when GigaSpaces provides data persistence, distributed processing, compared against a pure caching system and caching by interfacing with SQL and NoSQL data stores. • High volume processing - RDBMS may be too expensive The GigaSpaces API abstracts all back-end operations (job or slow to run in comparison dispatching, data persistence, and caching) and makes DZone, Inc. |
  5. 5. 5 Getting Started with NoSQL and Data Scalabilityit transparent to the application. GigaSpaces XAP mayimplement distributed computing operations like MapReduce GOOGLE APP ENGINE DATASTOREand run them in its nodes, or it may dispatch them forprocessing to the underlying subsystem if the functionality The Datastore is the main scalability feature of Google Appis available (e.g. mongoDB, Hadoop). The application may Engine applications. It’s not a relational database or a façadeimplement transactions using the Spring API, the GigaSpaces for one. Datastore is a public API for accessing Google’sXPA transactional facilities, or by implementing a workflow Bigtable high-performance distributed database system.where specific NoSQL stores handle entities or groups of Think of it as a sparse array distributed across multiple serversentities. This must be implemented explicitly in systems like that also allows an infinite number of columns and rows.mongoDB, but may exist in other NoSQL systems like Google Applications may even define new columns “on the fly”. TheApp Engine’s Datastore. Datastore scales by adding new servers to a cluster; Google provides this functionality without user participation.GigaSpaces XAP Programming Google Your Applications ApplicationsThe GigaSpaces XAP API is very rich and covers many aspectsbeyond NoSQL and data scalability areas like configuration Datastore API 1 Datastore Java (JDO, JPA) Other language Pythonmanagement, deployment, web services, third-party product Bigtableintegration, etc. Master Server (Logical table management, load balancing, garbage collection)The GigaSpaces XAP documentation is at: Tablet Tablet Tablet Server 0 Server 1 Server nNoSQL operations may be implemented over these APIs: Google File System • SQLQuery - allows querying a space using a SQL-like FS 0 FS 1 FS 2 FS n syntax and regular expressions; do not confuse it with Figure 6 - Datastore Architecture JDBC support. Datastore operations are defined around entities. Entities • Persistency - mostly supports RDBMSs but may implement can have one-to-many or many-to-many relationships. The other persistency mechanisms through the External Data Datastore assigns unique IDs unless the application specifies Source Components API. a unique key. Datastore also disallows some property names • memcached - support for key/value pair distributed that it uses for housekeeping. The complete Datastore dictionaries available to any client in the grid; entities documentation is available from: are automatically made available across all nodes. The memcached API is implemented on top of the data grid, and it’s interchangeable with other memcached Did you notice the parallels between Datastore implementations. Hot and mongoDB so far? Many NoSQL database • Task Execution - allows synchronous and asynchronous job Tip implementations have solved similar problems in execution on specific nodes or clusters similar ways.The GigaSpaces XAP API is in a minority of stateful NoSQL Transactions and Entity Groupssystems. Most NoSQL systems strive to achieve statelessness Datastore supports transactions. These transactions are onlyto increase scalability and data consistency. effective on entities that belong to the same entity group. Entities in a given group are guaranteed to be stored on theCommon Use Cases same server. • Real-time analytics - dynamic data analysis and reporting Datastore Programming based on data entered into a system less than a minute The programming model is based on inheritance of basic before effective time of use entities, db.Model and db.Expando. Persistent data is • Map/reduce - distributed data processing of large data mapped onto an entity specialization of either of these classes. sets across a computational grid The API provides persistence and querying instance methods • Near-zero downtime - allows for database schema for every entity managed by the Datastore. changes without homebrew master/slave configurations or Programming Example proprietary RDBMS dependencies The Datastore API is simpler than other NoSQL APIs and isGigaSpaces XAP NoSQL Drawbacks highly optimized to work in the App Engine environment. In • Complexity - the server, transactional, and grid model are this example we insert data into the data store, then run a query: more complex than for other NoSQL systems from google.appengine.ext import db • Application server model - the API and components are class Person(db.Model): name = db.StringProperty(required=True) geared toward building applications and transactional age = db.IntegerProperty(require=True) logic person = Person(name = Tom, age = 42) • Steeper learning curve person.put() • Higher TCO - brings a requirement of a specialized, person = Person(name = Sue, age = 69) well-trained system administration team with higher requirements than other NoSQL systems person.put() DZone, Inc. |