NoSQL Solutions - a comparative study
Upcoming SlideShare
Loading in...5
×
 

NoSQL Solutions - a comparative study

on

  • 821 views

 

Statistics

Views

Total Views
821
Views on SlideShare
821
Embed Views
0

Actions

Likes
0
Downloads
9
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NoSQL Solutions - a comparative study NoSQL Solutions - a comparative study Presentation Transcript

  • NoSQL solutionsA comparative study
  • Brief history of databases ● 1980: Oracle DBMS released ● 1995: first release of MySQL, a lightweight, asynchronously replicated database ● 2000: MySQL 3.23 is adopted by most startups as a database plateform ● 2004: Google develops BigTable ● 2009: The term NoSQL is coined out
  • Evolution of database storage● Monolithic: using traditional databases (Oracle, Sybase...) ○ High cost in infrastructure (big CPU, big SAN) ○ High cost in software licenses (commercial software) ○ Qualified personnel required (certified DBAs)● LAMP platform (circa 2000) ○ Free software ○ Runs on commodity hardware ○ Low administration (dont need a DBA until your data grows important)● NoSQL ○ Scales indefinitely (replication only scales vertically) ○ Logic is in the application ○ Doesnt require SQL or internals knowledge
  • Scaling● Vertical scaling (MySQL) ○ Scaling by adding more replicas ○ Load is evenly distributed across the replicas ○ Problems ! ■ Cached data is also distributed evenly: inefficient resource usage ■ Efficiency goes down as dataset exceeds available memory, cannot get more performance by adding replicas ■ Write bottleneck● Horizontal scaling (NoSQL) ○ Data is distributed evenly across the nodes (hashing) ○ More capacity ? Just add one node ○ Loss of traditional database properties (ACID)
  • NoSQL definition● Not only SQL (is not exposed via query language)● Non-relational (denormalized data)● Distributed (horizontal partitioning)● Different implementations : ○ Key-value store ○ Document database ○ Graph database
  • Key-value stores ● Schema-less storage ● Basic associative arrays{ "username"=> "guillaume" } ● Key-value stores can have column families and subkeys{ "user:name"=> "guillaume", "user:uid" => 1000 } ● Implementations ○ K/V caches: Redis, memcache ■ in-memory databases ○ Column databases: Cassandra, HBase ■ Data is stored in a tabular fashion (as opposed to rows in traditional RDBMS)
  • Document Databases ● Data is organized into documents:FirstName="Frank", City="Haifa", Hobby="Photographing" ● No strong typing or predefined fields; additional information can be added easilyFirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:"English"}, {Name:"Tagalog"}] ● An ensemble of documents is called a collection ● Uses structured standards: XML, JSON ● Implementations ○ CouchDB (Erlang) ○ MongoDB (C++) ○ RavenDB (.NET)
  • Graph Databases● Uses graph theory structure to represent information● Typically used for relations ○ Example: followers/following in Twitter
  • Databases at Toluna ● MySQL ○ Traditional Master-Slave configuration ○ Very efficient for small requests ○ Not good for analytics ○ "Big Data" issues (i.e. usersvotes) ● Microsoft SQL Server ○ Good all-around performance ○ Monolithic ■ Suffers from locking issues ■ Hard to scale (many connections) ○ Potentially complex SQL programming to get the better of it
  • Solutions ?Lets evaluate some products...
  • Apache HBase ● A column database based on the Hadoop architecture ● Commercially supported (Cloudera) ● Available on Red Hat, Debian ● Designed for very big data storage (Terabytes) ● Users: Facebook, Yahoo!, Adobe, Mahalo, TwitterPros ● Pure Java implementation ● Access to Hadoop MapReduce data via column storage ● True clustered architectureCons ● Java ● Hard to deploy and maintain ● Limited options via the API (get, put, scans)
  • Apache HBase: architecture● Data is stored in cells ○ Primary row key ○ Column family ■ Limited ■ May have indefinite qualifiers ○ Timestamp (version)● Example cell structureRowId Column Family:Qualifier Timestamp Value1000 user:name 1312868789 guillaume1000 user:email 1312868789 g@dragonscale.eu
  • HBase Data operations ● Creating a table and put some datacreate table usersvotes, userid, date, metahadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid,HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/ ● Retrieve datahbase(main):002:0> get usersvotes, 1071726COLUMN CELL date: timestamp=1312780185940, value=1296523245 meta:answer timestamp=1312780185940, value=2 meta:country timestamp=1312780185940, value=ES userid: timestamp=1312780185940, value=6853524 row(s) in 0.0720 seconds ● The last versioned row (higher timestamp) for the specified primary key is retrieved
  • HBase Data operations: API● If not using JAVA, Hbase must be queried using a webservice (XML, JSON or protobuf)● Type of operations ○ Get (read single value) ○ Put (write single value) ○ Get multi (read multiple versions) ○ Scan (retrieve multiple rows via a scan)● MapReduce jobs can be run vs. the database using JAVA code
  • What is MapReduce ?"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributesthose to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.The worker node processes that smaller problem, and passes the answer back to its master node."Reduce" step: The master node then takes the answers to all the sub-problems and combines themin some way to get the output – the answer to the problem it was originally trying to solve.
  • MongoDB ● A document-oriented database ● Written in C++ ● Uses javascript as functional specification ● Commercial support (10gen) ● Large user base (NY Times, Disney, MTV, foursquare)Pros ● Easy installation and deployment ● Sharding and replication ● Easy API (javascript), multiple languages ● Similarities to MySQL (indexes, queries)Cons ● Versions < 1.8.0 had many issues (development not mature): consistency, crashes...
  • MongoDB data structure ● No predefinition of fields; all operations are implicit# Create document> d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3,"country" : "GB" };# Save it into a new collection> db.usersvotes.save(d);# Retrieve documents from the collection> db.usersvotes.find();{ "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" :NumberLong(1293874493), "answer" : 3, "country" : "GB" }
  • MongoDB Indexes and QueriesLets create an index...db.usersvotes.ensureIndex({pollid: 1, date :-1}) ● Indexes can be created in the background ● Index keys are sortable ● Queries without indexes are slow (scans)Lets get a usersvotes streamdb.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10);# equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATEDESC LIMIT 10,10;{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong(1295077466), "answer" : 1, "country" : "GB" }{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong(1295077466), "answer" : 5, "country" : "GB" }
  • MongoDB MapReduce ● Uses javascript functions ● Single-threaded (javascript limitation) ● Parallelized (runs across all shards)# Aggregate data over the country fieldmap = function() { emit ( this.country, { count: 1 } ) };# Count items in each countryr=function(k,vals) { var result = { count: 0 };vals.forEach(function(value) {result.count += value.count;}); return result; }# Start the jobres=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
  • VoltDB A SQL alternative ● "NewSQL" ● In-memory databasePros ● Lock-free ● ACID properties ● Linear scalability ● Fault tolerant ● Java implementationCons ● Cannot query the database directly; java stored procs only ● Database shutdown needed to modify schema ● Database shutdown needed to add cluster nodes ● No more memory = no more storage
  • What about MySQL ? Potential solutions
  • MySQL some interesting facts ● In single node tests, MySQL was always faster than NoSQL solutions ● Load data was faster ○ Sample usersvotes data (1G tsv file) ■ MySQL: 20 seconds ■ MongoDB : >10 minutes ■ HBase: >30 minutes ● Proven technology
  • MySQL Analytics ● MySQL might be outperformed by other solutions in analytics depending on the data size ● There are several column database solutions existing for MySQL (Infobright, ICE, Tokutek) ● Word count operations can be offloaded to a full-text search engine (Sphinx, SolR, Lucene)
  • MySQL Big Data ● The Vote Stream case ● Simple query explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit10,201, SIMPLE, usersvotes, ref, Index_POLLID, Index_POLLID, 8, const, 2556, Using where;Using filesort ● Can be easily solved by covering indexALTER TABLE usersvotes ADD KEY (pollid, votedate) ● But ! ○ usersvotes = 160Gb datafiles ○ Adding index: offline operation, would take hours ○ Online schema change could be used, but might run out of space and/or take days
  • Conclusions● HBase: good choice for analytics, but not very adapted to traditional database operations ○ Most companies use HBase/Hadoop to offload analytical data from their main database ○ Java experience needed (which Toluna has imho) ○ IT must be trained● MongoDB ○ Very good choice for starting new web applications "from the ground up"● VoltDB ○ Great technology but lack of flexibility● Traditional databases ○ Will probably be around for long time
  • Questions ?