Your SlideShare is downloading. ×
NoSQL Solutions - a comparative study
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

NoSQL Solutions - a comparative study


Published on

Published in: Technology

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. NoSQL solutionsA comparative study
  • 2. Brief history of databases ● 1980: Oracle DBMS released ● 1995: first release of MySQL, a lightweight, asynchronously replicated database ● 2000: MySQL 3.23 is adopted by most startups as a database plateform ● 2004: Google develops BigTable ● 2009: The term NoSQL is coined out
  • 3. Evolution of database storage● Monolithic: using traditional databases (Oracle, Sybase...) ○ High cost in infrastructure (big CPU, big SAN) ○ High cost in software licenses (commercial software) ○ Qualified personnel required (certified DBAs)● LAMP platform (circa 2000) ○ Free software ○ Runs on commodity hardware ○ Low administration (dont need a DBA until your data grows important)● NoSQL ○ Scales indefinitely (replication only scales vertically) ○ Logic is in the application ○ Doesnt require SQL or internals knowledge
  • 4. Scaling● Vertical scaling (MySQL) ○ Scaling by adding more replicas ○ Load is evenly distributed across the replicas ○ Problems ! ■ Cached data is also distributed evenly: inefficient resource usage ■ Efficiency goes down as dataset exceeds available memory, cannot get more performance by adding replicas ■ Write bottleneck● Horizontal scaling (NoSQL) ○ Data is distributed evenly across the nodes (hashing) ○ More capacity ? Just add one node ○ Loss of traditional database properties (ACID)
  • 5. NoSQL definition● Not only SQL (is not exposed via query language)● Non-relational (denormalized data)● Distributed (horizontal partitioning)● Different implementations : ○ Key-value store ○ Document database ○ Graph database
  • 6. Key-value stores ● Schema-less storage ● Basic associative arrays{ "username"=> "guillaume" } ● Key-value stores can have column families and subkeys{ "user:name"=> "guillaume", "user:uid" => 1000 } ● Implementations ○ K/V caches: Redis, memcache ■ in-memory databases ○ Column databases: Cassandra, HBase ■ Data is stored in a tabular fashion (as opposed to rows in traditional RDBMS)
  • 7. Document Databases ● Data is organized into documents:FirstName="Frank", City="Haifa", Hobby="Photographing" ● No strong typing or predefined fields; additional information can be added easilyFirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:"English"}, {Name:"Tagalog"}] ● An ensemble of documents is called a collection ● Uses structured standards: XML, JSON ● Implementations ○ CouchDB (Erlang) ○ MongoDB (C++) ○ RavenDB (.NET)
  • 8. Graph Databases● Uses graph theory structure to represent information● Typically used for relations ○ Example: followers/following in Twitter
  • 9. Databases at Toluna ● MySQL ○ Traditional Master-Slave configuration ○ Very efficient for small requests ○ Not good for analytics ○ "Big Data" issues (i.e. usersvotes) ● Microsoft SQL Server ○ Good all-around performance ○ Monolithic ■ Suffers from locking issues ■ Hard to scale (many connections) ○ Potentially complex SQL programming to get the better of it
  • 10. Solutions ?Lets evaluate some products...
  • 11. Apache HBase ● A column database based on the Hadoop architecture ● Commercially supported (Cloudera) ● Available on Red Hat, Debian ● Designed for very big data storage (Terabytes) ● Users: Facebook, Yahoo!, Adobe, Mahalo, TwitterPros ● Pure Java implementation ● Access to Hadoop MapReduce data via column storage ● True clustered architectureCons ● Java ● Hard to deploy and maintain ● Limited options via the API (get, put, scans)
  • 12. Apache HBase: architecture● Data is stored in cells ○ Primary row key ○ Column family ■ Limited ■ May have indefinite qualifiers ○ Timestamp (version)● Example cell structureRowId Column Family:Qualifier Timestamp Value1000 user:name 1312868789 guillaume1000 user:email 1312868789
  • 13. HBase Data operations ● Creating a table and put some datacreate table usersvotes, userid, date, metahadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid,HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/ ● Retrieve datahbase(main):002:0> get usersvotes, 1071726COLUMN CELL date: timestamp=1312780185940, value=1296523245 meta:answer timestamp=1312780185940, value=2 meta:country timestamp=1312780185940, value=ES userid: timestamp=1312780185940, value=6853524 row(s) in 0.0720 seconds ● The last versioned row (higher timestamp) for the specified primary key is retrieved
  • 14. HBase Data operations: API● If not using JAVA, Hbase must be queried using a webservice (XML, JSON or protobuf)● Type of operations ○ Get (read single value) ○ Put (write single value) ○ Get multi (read multiple versions) ○ Scan (retrieve multiple rows via a scan)● MapReduce jobs can be run vs. the database using JAVA code
  • 15. What is MapReduce ?"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributesthose to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.The worker node processes that smaller problem, and passes the answer back to its master node."Reduce" step: The master node then takes the answers to all the sub-problems and combines themin some way to get the output – the answer to the problem it was originally trying to solve.
  • 16. MongoDB ● A document-oriented database ● Written in C++ ● Uses javascript as functional specification ● Commercial support (10gen) ● Large user base (NY Times, Disney, MTV, foursquare)Pros ● Easy installation and deployment ● Sharding and replication ● Easy API (javascript), multiple languages ● Similarities to MySQL (indexes, queries)Cons ● Versions < 1.8.0 had many issues (development not mature): consistency, crashes...
  • 17. MongoDB data structure ● No predefinition of fields; all operations are implicit# Create document> d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3,"country" : "GB" };# Save it into a new collection>;# Retrieve documents from the collection> db.usersvotes.find();{ "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" :NumberLong(1293874493), "answer" : 3, "country" : "GB" }
  • 18. MongoDB Indexes and QueriesLets create an index...db.usersvotes.ensureIndex({pollid: 1, date :-1}) ● Indexes can be created in the background ● Index keys are sortable ● Queries without indexes are slow (scans)Lets get a usersvotes streamdb.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10);# equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATEDESC LIMIT 10,10;{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong(1295077466), "answer" : 1, "country" : "GB" }{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong(1295077466), "answer" : 5, "country" : "GB" }
  • 19. MongoDB MapReduce ● Uses javascript functions ● Single-threaded (javascript limitation) ● Parallelized (runs across all shards)# Aggregate data over the country fieldmap = function() { emit (, { count: 1 } ) };# Count items in each countryr=function(k,vals) { var result = { count: 0 };vals.forEach(function(value) {result.count += value.count;}); return result; }# Start the jobres=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
  • 20. VoltDB A SQL alternative ● "NewSQL" ● In-memory databasePros ● Lock-free ● ACID properties ● Linear scalability ● Fault tolerant ● Java implementationCons ● Cannot query the database directly; java stored procs only ● Database shutdown needed to modify schema ● Database shutdown needed to add cluster nodes ● No more memory = no more storage
  • 21. What about MySQL ? Potential solutions
  • 22. MySQL some interesting facts ● In single node tests, MySQL was always faster than NoSQL solutions ● Load data was faster ○ Sample usersvotes data (1G tsv file) ■ MySQL: 20 seconds ■ MongoDB : >10 minutes ■ HBase: >30 minutes ● Proven technology
  • 23. MySQL Analytics ● MySQL might be outperformed by other solutions in analytics depending on the data size ● There are several column database solutions existing for MySQL (Infobright, ICE, Tokutek) ● Word count operations can be offloaded to a full-text search engine (Sphinx, SolR, Lucene)
  • 24. MySQL Big Data ● The Vote Stream case ● Simple query explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit10,201, SIMPLE, usersvotes, ref, Index_POLLID, Index_POLLID, 8, const, 2556, Using where;Using filesort ● Can be easily solved by covering indexALTER TABLE usersvotes ADD KEY (pollid, votedate) ● But ! ○ usersvotes = 160Gb datafiles ○ Adding index: offline operation, would take hours ○ Online schema change could be used, but might run out of space and/or take days
  • 25. Conclusions● HBase: good choice for analytics, but not very adapted to traditional database operations ○ Most companies use HBase/Hadoop to offload analytical data from their main database ○ Java experience needed (which Toluna has imho) ○ IT must be trained● MongoDB ○ Very good choice for starting new web applications "from the ground up"● VoltDB ○ Great technology but lack of flexibility● Traditional databases ○ Will probably be around for long time
  • 26. Questions ?