NoSQL - Life Beyond the Outer Join
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

NoSQL - Life Beyond the Outer Join

on

  • 4,394 views

This talk was from the Canberra Java User Group in July 2010. You can download the source to go with these slides from http://bitbucket.org/glen_a_smith/cjug-nosql-examples. ...

This talk was from the Canberra Java User Group in July 2010. You can download the source to go with these slides from http://bitbucket.org/glen_a_smith/cjug-nosql-examples.

The NoSQL (Not Only SQL) movement has been gaining a lot of press over the last year as a means of scaling massive data storage, complex relationships and lightening fast retrieval for the Web's biggest sites. This month we're taking a trip to the big end of town and looking at some of the backend technologies that are powering sites like Twitter, Facebook, LinkedIn, Reddit, Digg and Google.

We'll be looking at popular Java clients and servers that play in the NoSQL space and have a brief survey of the following popular NoSQL platforms: Document Databases (CouchDB), Sophisticated Key/Value Stores (Voldemort), Graph Databases (Neo4j), and simple Key/Value stores (Memcached). It'l be a lightening tour of what each technology offers, some source code on how it works, and lots of headshifts about how to store data such that you don't ever need another Left Outer Join!

Statistics

Views

Total Views
4,394
Views on SlideShare
4,138
Embed Views
256

Actions

Likes
9
Downloads
84
Comments
0

5 Embeds 256

http://blogs.bytecode.com.au 250
http://static.slidesharecdn.com 2
http://blogs.bytecode.com.au:80 2
http://www.slideshare.net 1
http://java.hoofoo.net 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

NoSQL - Life Beyond the Outer Join Presentation Transcript

  • 1. NoSQL - Life beyond the Outer Join Glen Smith (glen@bytecode.com.au)
  • 2. Objectives  Survey the landscape of NoSQL offerings  Learn some of the terminology  Look at some of the Java offerings in the space  Take away source to play with  Be able to ask questions (but you may not get answers)
  • 3. What is NoSQL?  (N)ot (O)nly SQL not “Anti SQL”  Movement more than “one” technology  Distributed Storage System  Much weaker queries  Scale across many machines  Much larger data, much faster queries
  • 4. Why NoSQL?  Inspired by Distributed Data Storage problems  Scale easily by adding servers  Not suited to all problem types, but super-suited to certain large problem types  High-write situations (eg activity tracking or timeline rendering for millions of users)  A lot of relational uses are really dumbed down (eg fetch by PK with update)
  • 5. What’s wrong with RDBMS?  Nothing ;-)  To scale RDBMS, your approach is typically:  Shard your datasource  Put in a bunch of read replicas  Put memcached in front of those  What could possibly go wrong?   Complex. Custom caching. Partitioning. Migrating of shards. Tons of moving parts.
  • 6. How can I live w/o ACID?  Atomic (it happens or not, no partial completes)  Consistent (DB internals, ref integ, field validate)  Isolated (Can’t modify uncommitted data)  Durable (written to disk/transaction log)  But in a distributed db, life is not so simple...
  • 7. The CAP theorum In a distributed system, when you have state on more than one machine, pick any two:  Consistency (easy in read-only states – copy!)  Availability (can you get at your data? Is it up?)  Partition Tolerance (3 machines on one net, 3 on the other, with a broken link. How do you take updates since you can’t keep people up to date. What if you don’t agree on what’s up?)
  • 8. How do these NoSQL things work?  Basically big distributed hashtables  Push all logic into the write (update two lists – one for userId, one for email)  Things don’t happen transactionally. These are two writes.  There is no free lunch. The programmer is now handling consistency problems.  You were thinking about query optimisation before, and now even more so.
  • 9. How big are we talking?  Digg - 3Tb  Facebook Inbox – 50 Tb  eBay – 2 Pb  Think about Twitter’s issues.. Billion of queries a second over Tb of data.
  • 10. The NoSQL Taxonomy  Key-Value In-Memory stores (Memcached, Redis)  Key-Value “Eventually Consistent” stores (“Dynamo Clones” like Cassandra, Voldemort, Riak)  Document stores (Couchdb, Mongodb, JCR)  Graph Databases (Neo4j)  Tabular (“BigTable clones” like Hadoop/Hbase)
  • 11. Memcached  Developed for the original LiveJournal site  LRU, distributed hashtable  Logic is in both client and server  Used in Google App Engine, Facebook, Twitter  Ehcache now has similar service  Good for things that outlive an app server
  • 12. How does it work?  Clients know how to:  Send items to servers (consistent hashing)  What to do when a server fails  How to fetch keys from servers  Can “weigh” to server capacities  Servers know how to:  Store items they receive  Expire them from the cache  No inter-server comms – everything is unaware
  • 13. Sample Code
  • 14. Voldemort  Less than Memcached, but also more!  Not a cache, but a distributed key/value store  Developed by LinkedIn  Works on distributed hashmap w/failover  Logic can be in client/server or just server  Pluggable storage (mysql,bdb,mock)  Pluggable serialization (JSON, Google PB, etc)
  • 15. “Relaxed” Consistency  Eventual consistency – data will come into sync but not immediately on the write. In practice “pretty soon” is milliseconds later  We are actually used to this – eg Google indexes update every so often.  Guarantees to read your own writes (eg your profile on LinkedIn)  Tuneable to better performance/weaker consistency
  • 16. What’s attractive?  Data is automatically replicated  Partitioning ensures all servers have subset  Server failure is handled transparently  Data is rebalanced when servers added/removed  Serialization is pluggable  Apache License
  • 17. Impressive Performance  “We were able to move applications that needed to handle hundreds of millions of reads and writes per day from over 400ms to under 10ms while simultaneously increasing the amount of data we store.”
  • 18. Performance Info http://www.slideshare.net/bhupeshbansal/hadoop-user-group-jan2010
  • 19. Sample Script  Starting the server (or deploy as a .war) binvoldemort-server.bat configsingle_node_cluster  Starting the console binvoldemort-shell.bat test tcp://localhost:6666  Run some queries put “hello” “world” get “hello” put “hello” “world 2.0” delete “hello”
  • 20. Sample Code
  • 21. CouchDb  Document-Oriented Db – No Schema  Written in Erlang (!) by a Notes Dev (!!!)  Everything is stored in JSON, Restful API  Clever replication concepts – works in disconnected settings  Every write is a new document, version  Map/Reduce baked in  Apache License
  • 22. What’s attractive?  Schemaless operation – Adhoc data  Incremental replication (great for disconnected settings)  Great fault-tolerance (with versioned conflicts)  Fast query with flexibility (MapReduce)
  • 23. So what is this Map/Reduce thing?  Popularized by Google’s BigTable  Map functions collect documents matching criteria and create a B-Tree  Reduce functions operate on the B-Tree  Everything happens in parallel on many machines  Example: distributed grep
  • 24. The Naked Couch  http://127.0.0.1:5984/  http://127.0.0.1:5984/_all_dbs  http://127.0.0.1:5984/mydb (PUT)  http://127.0.0.1:5984/_utils/ (Futon)
  • 25. Mapping Couch with Ekron  You lose some of the joy of schema-less  But you do get lots of boilerplate ;-)  Oh, and strong typing.
  • 26. Writing a Couch MapReduce  You write a map function to extract data  You always return a key/value pair function(doc) { if (doc.title.indexOf(“Hi!") > -1) { emit(doc.title, doc); } }
  • 27. Neo4j  Stored data in a graph of nodes and r’ships  Can handle billions of nodes per machine  Means you can query on relationships!  Supports ACID transactions  One 500kb jar (!)  Dual-licensed GPL/Commercial
  • 28. Sample Code
  • 29. Blogvertising  http://blogs.bytecode.com.au/glen  http://twitter.com/glen_a_smith  http://grailspodcast.com/  Download all the source from today:  http://bitbucket.org/glen_a_smith/cjug-nosql- examples
  • 30. Q&A Looking for a good book?