NoSQL Solutions - a comparative study

  • 507 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
507
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
12
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. NoSQL solutionsA comparative study
  • 2. Brief history of databases ● 1980: Oracle DBMS released ● 1995: first release of MySQL, a lightweight, asynchronously replicated database ● 2000: MySQL 3.23 is adopted by most startups as a database plateform ● 2004: Google develops BigTable ● 2009: The term NoSQL is coined out
  • 3. Evolution of database storage● Monolithic: using traditional databases (Oracle, Sybase...) ○ High cost in infrastructure (big CPU, big SAN) ○ High cost in software licenses (commercial software) ○ Qualified personnel required (certified DBAs)● LAMP platform (circa 2000) ○ Free software ○ Runs on commodity hardware ○ Low administration (dont need a DBA until your data grows important)● NoSQL ○ Scales indefinitely (replication only scales vertically) ○ Logic is in the application ○ Doesnt require SQL or internals knowledge
  • 4. Scaling● Vertical scaling (MySQL) ○ Scaling by adding more replicas ○ Load is evenly distributed across the replicas ○ Problems ! ■ Cached data is also distributed evenly: inefficient resource usage ■ Efficiency goes down as dataset exceeds available memory, cannot get more performance by adding replicas ■ Write bottleneck● Horizontal scaling (NoSQL) ○ Data is distributed evenly across the nodes (hashing) ○ More capacity ? Just add one node ○ Loss of traditional database properties (ACID)
  • 5. NoSQL definition● Not only SQL (is not exposed via query language)● Non-relational (denormalized data)● Distributed (horizontal partitioning)● Different implementations : ○ Key-value store ○ Document database ○ Graph database
  • 6. Key-value stores ● Schema-less storage ● Basic associative arrays{ "username"=> "guillaume" } ● Key-value stores can have column families and subkeys{ "user:name"=> "guillaume", "user:uid" => 1000 } ● Implementations ○ K/V caches: Redis, memcache ■ in-memory databases ○ Column databases: Cassandra, HBase ■ Data is stored in a tabular fashion (as opposed to rows in traditional RDBMS)
  • 7. Document Databases ● Data is organized into documents:FirstName="Frank", City="Haifa", Hobby="Photographing" ● No strong typing or predefined fields; additional information can be added easilyFirstName="Guillaume", Address="Hidalgo Village, Pasay City", Languages=[{Name:"French"}, {Name:"English"}, {Name:"Tagalog"}] ● An ensemble of documents is called a collection ● Uses structured standards: XML, JSON ● Implementations ○ CouchDB (Erlang) ○ MongoDB (C++) ○ RavenDB (.NET)
  • 8. Graph Databases● Uses graph theory structure to represent information● Typically used for relations ○ Example: followers/following in Twitter
  • 9. Databases at Toluna ● MySQL ○ Traditional Master-Slave configuration ○ Very efficient for small requests ○ Not good for analytics ○ "Big Data" issues (i.e. usersvotes) ● Microsoft SQL Server ○ Good all-around performance ○ Monolithic ■ Suffers from locking issues ■ Hard to scale (many connections) ○ Potentially complex SQL programming to get the better of it
  • 10. Solutions ?Lets evaluate some products...
  • 11. Apache HBase ● A column database based on the Hadoop architecture ● Commercially supported (Cloudera) ● Available on Red Hat, Debian ● Designed for very big data storage (Terabytes) ● Users: Facebook, Yahoo!, Adobe, Mahalo, TwitterPros ● Pure Java implementation ● Access to Hadoop MapReduce data via column storage ● True clustered architectureCons ● Java ● Hard to deploy and maintain ● Limited options via the API (get, put, scans)
  • 12. Apache HBase: architecture● Data is stored in cells ○ Primary row key ○ Column family ■ Limited ■ May have indefinite qualifiers ○ Timestamp (version)● Example cell structureRowId Column Family:Qualifier Timestamp Value1000 user:name 1312868789 guillaume1000 user:email 1312868789 g@dragonscale.eu
  • 13. HBase Data operations ● Creating a table and put some datacreate table usersvotes, userid, date, metahadoop jar /usr/lib/hbase/hbase-0.90.3-cdh3u1.jar importtsv -Dimporttsv.columns=userid,HBASE_ROW_KEY,date,meta:answer,meta:country usersvotes ~/import/ ● Retrieve datahbase(main):002:0> get usersvotes, 1071726COLUMN CELL date: timestamp=1312780185940, value=1296523245 meta:answer timestamp=1312780185940, value=2 meta:country timestamp=1312780185940, value=ES userid: timestamp=1312780185940, value=6853524 row(s) in 0.0720 seconds ● The last versioned row (higher timestamp) for the specified primary key is retrieved
  • 14. HBase Data operations: API● If not using JAVA, Hbase must be queried using a webservice (XML, JSON or protobuf)● Type of operations ○ Get (read single value) ○ Put (write single value) ○ Get multi (read multiple versions) ○ Scan (retrieve multiple rows via a scan)● MapReduce jobs can be run vs. the database using JAVA code
  • 15. What is MapReduce ?"Map" step: The master node takes the input, partitions it up into smaller sub-problems, and distributesthose to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure.The worker node processes that smaller problem, and passes the answer back to its master node."Reduce" step: The master node then takes the answers to all the sub-problems and combines themin some way to get the output – the answer to the problem it was originally trying to solve.
  • 16. MongoDB ● A document-oriented database ● Written in C++ ● Uses javascript as functional specification ● Commercial support (10gen) ● Large user base (NY Times, Disney, MTV, foursquare)Pros ● Easy installation and deployment ● Sharding and replication ● Easy API (javascript), multiple languages ● Similarities to MySQL (indexes, queries)Cons ● Versions < 1.8.0 had many issues (development not mature): consistency, crashes...
  • 17. MongoDB data structure ● No predefinition of fields; all operations are implicit# Create document> d = { "userid" : 8173095, "pollid" : 53064, "date" : NumberLong(1293874493), "answer" : 3,"country" : "GB" };# Save it into a new collection> db.usersvotes.save(d);# Retrieve documents from the collection> db.usersvotes.find();{ "_id" : ObjectId("4e3bde5ae84838f87bf883b2"), "userid" : 8173095, "pollid" : 53064, "date" :NumberLong(1293874493), "answer" : 3, "country" : "GB" }
  • 18. MongoDB Indexes and QueriesLets create an index...db.usersvotes.ensureIndex({pollid: 1, date :-1}) ● Indexes can be created in the background ● Index keys are sortable ● Queries without indexes are slow (scans)Lets get a usersvotes streamdb.usersvotes.find({pollid: 676781}).sort({date: -1}).skip(10).limit(10);# equals to the following SQL : SELECT * FROM usersvotes WHERE pollid = 676781 ORDER BY DATEDESC LIMIT 10,10;{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e6"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong(1295077466), "answer" : 1, "country" : "GB" }{ "_id" : ObjectId("4e3bde5ce84838f87bfa31e7"), "userid" : 8130783, "pollid" : 676781, "date" : NumberLong(1295077466), "answer" : 5, "country" : "GB" }
  • 19. MongoDB MapReduce ● Uses javascript functions ● Single-threaded (javascript limitation) ● Parallelized (runs across all shards)# Aggregate data over the country fieldmap = function() { emit ( this.country, { count: 1 } ) };# Count items in each countryr=function(k,vals) { var result = { count: 0 };vals.forEach(function(value) {result.count += value.count;}); return result; }# Start the jobres=db.usersvotes.mapReduce(m,r, {out: { inline : 1}});
  • 20. VoltDB A SQL alternative ● "NewSQL" ● In-memory databasePros ● Lock-free ● ACID properties ● Linear scalability ● Fault tolerant ● Java implementationCons ● Cannot query the database directly; java stored procs only ● Database shutdown needed to modify schema ● Database shutdown needed to add cluster nodes ● No more memory = no more storage
  • 21. What about MySQL ? Potential solutions
  • 22. MySQL some interesting facts ● In single node tests, MySQL was always faster than NoSQL solutions ● Load data was faster ○ Sample usersvotes data (1G tsv file) ■ MySQL: 20 seconds ■ MongoDB : >10 minutes ■ HBase: >30 minutes ● Proven technology
  • 23. MySQL Analytics ● MySQL might be outperformed by other solutions in analytics depending on the data size ● There are several column database solutions existing for MySQL (Infobright, ICE, Tokutek) ● Word count operations can be offloaded to a full-text search engine (Sphinx, SolR, Lucene)
  • 24. MySQL Big Data ● The Vote Stream case ● Simple query explain select * from toluna_polls.usersvotes where pollid=843206 order by votedate desc limit10,201, SIMPLE, usersvotes, ref, Index_POLLID, Index_POLLID, 8, const, 2556, Using where;Using filesort ● Can be easily solved by covering indexALTER TABLE usersvotes ADD KEY (pollid, votedate) ● But ! ○ usersvotes = 160Gb datafiles ○ Adding index: offline operation, would take hours ○ Online schema change could be used, but might run out of space and/or take days
  • 25. Conclusions● HBase: good choice for analytics, but not very adapted to traditional database operations ○ Most companies use HBase/Hadoop to offload analytical data from their main database ○ Java experience needed (which Toluna has imho) ○ IT must be trained● MongoDB ○ Very good choice for starting new web applications "from the ground up"● VoltDB ○ Great technology but lack of flexibility● Traditional databases ○ Will probably be around for long time
  • 26. Questions ?