A Walk down NOSQL Lane
      in the Cloud

    New York City Cloud Computing Group
                          February 2011
                       Alexander Sicular
                               @siculars
Who is this blowhard?
Columbia University pays my mortgage

For the better part of a decade in Medical
Informatics

Am not shilling for any of these companies

Am not a computer scientist

Am a computer science enthusiast
particularly in the area of Informatics
When I put my data in
the “cloud”, to me it
 just means that it’s
    virtualized in
   someone else’s
     server room
...the Silver Lining
Many, many providers and only growing

  Amazon, Rackspace, Joyent, CouchOne,
  Cloudant, Azure, GAE, Heroku, no.de

Outsourced management

Zero capex

Controlled costs
...With a Chance of
         Rain?
Vendor lock in

Unreliable performance

  i/o

  cpu, memory

Bare metal > software virtualization
NoSQL or NOSQL?
Not Only SQL

Non/post relational

Big tent policy

Umbrella term

Fragmented



                      http://www.flickr.com/photos/morgennebel/2933723145/
Your Usage Patterns
Read vs. Write

Mutable vs. Immutable

Product Considerations:

  In place updates

  Write Only Logs
This vs. That
Riak wiki comparisons page
http://wiki.basho.com/Riak-Comparisons.html


Popular one page comparison of a number of
NOSQL players by Kristof Kovacs:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
NOSQL concepts are
  Not Brand New
Memcached since 2003                       http://memcached.org



Google papers 2004-2006

Amazon Dynamo 2007

Consistent Hashing 2007 http://www.last.fm/user/RJ/journal/
2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients


Using relational systems as a key-value blob
store
    2009 FriendFeed (not the first)         http://bret.appspot.com/entry/how-
    friendfeed-uses-mysql
Why NOSQL
Support for “Vary Large” data sets

Schemaless

Denormalized

Green field

New applications



                      http://www.flickr.com/photos/gailtang/1243984297/
Academia
Google:

  Bigtable        http://labs.google.com/papers/bigtable.html



  GFS     http://labs.google.com/papers/gfs.html



  M/R     http://labs.google.com/papers/mapreduce.html



Amazon:

  Dynamo         http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf




NOSQL Summer                   http://nosqlsummer.org/papers
Under the Hood
      Terminology
Write Only Log           http://en.wikipedia.org/wiki/Log-structured_file_system



Merkle Trees        http://en.wikipedia.org/wiki/Hash_tree



B-trees   http://en.wikipedia.org/wiki/B-tree



Vector clock       http://en.wikipedia.org/wiki/Vector_clock



Bloom filters       http://en.wikipedia.org/wiki/Bloom_filters



Big O Notation         http://en.wikipedia.org/wiki/Big_o_notation



Consistent Hashing              http://en.wikipedia.org/wiki/Consistent_hashing
CAP Theorem
           http://en.wikipedia.org/wiki/CAP_theorem




Consistency

Availability

Partition Tolerance

   Pick two?

                                             http://guide.couchdb.org/draft/consistency.html
CouchDB
CouchOne, Cloudant    HTTP interface

Erlang                Offline usage

Extreme replication   Sharded scaling
scenarios

Works on phones

Updated indexing
(b-tree)
CouchDB Internal
  Architecture




  http://nosqlpedia.com/wiki/File:CouchDB-Arch.JPG
MongoDB
10Gen, MongoHQ,      Soft landing for
MongoLab             those coming from
                     mysql (relational
C++                  databases)

huMONGOus            Native javascript

Sharded scaling,     Secondary indexes
replicated master/
slave

Located in NYC
(go visit them)
MongoDB Sharding
     Diagram




http://www.snailinaturtleneck.com/blog/2010/03/30/sharding-with-the-fishes/
MySQL to Mongo Query similarity




       http://nosqlpedia.com/wiki/File:MongoDB.JPG
Riak
Basho, Joyent               Multiple backends

Erlang                      Homogeneous

Distributed                 CAP tunable

HTTP, protobuf

Native javascript,
erlang
Hadoop
Cloudera, Apache       Huge ecosystem
Foundation
                          Yahoo, FB, Twitter,
Java                      Fortune 500

High latency              Pig, Hive, Flume

Batch oriented

HDFS is GFS based

Open source Google
stack via the Google
papers
HBase
Java

Low latency store

sits on top of Hadoop

Modeled after Google Bigtable

Column oriented

Thrift, protobuf

Backend for new Facebook Messaging service
Cassandra
Apache

Java

Column oriented

Like Bigtable and Dynamo

Originated at Facebook

At Twitter, Distributed counting
http://www.infoq.com/presentations/NoSQL-at-Twitter-by-Ryan-King
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
Redis
OpenRedis              incredibly fast

C                      memcached on
                       steroids
REmote
DIctionary             replicated
Server                 master/slave

Specific data
structures
Commonalities
Open Source

Adherence to common or standard:

  data formats

    json, bson, utf8, binary

  data trandport mechanisms

    http, thrift, protobuf,
    simple wire protocols
Ok. So Now What?
Analyze your requirements

Mailing lists

IRC, twitter

Project pages, wiki

Github/Google Code/Bitbucket:

  project page

  specific language clients
Variety Pack
Hybrid architectures will become the norm

Twitter - mysql, cassandra, hadoop

Google - mysql, GAE (BT)

Facebook - mysql,
cassandra, hbase,
memcached

Yahoo - mysql, hadoop

LinkedIn - voldemort       http://www.flickr.com/photos/uncleweed/82245324/
Questions?




New York City Cloud Computing Group
                      February 2011
                   Alexander Sicular
                           @siculars

A walk down NOSQL Lane in the cloud

  • 1.
    A Walk downNOSQL Lane in the Cloud New York City Cloud Computing Group February 2011 Alexander Sicular @siculars
  • 2.
    Who is thisblowhard? Columbia University pays my mortgage For the better part of a decade in Medical Informatics Am not shilling for any of these companies Am not a computer scientist Am a computer science enthusiast particularly in the area of Informatics
  • 3.
    When I putmy data in the “cloud”, to me it just means that it’s virtualized in someone else’s server room
  • 4.
    ...the Silver Lining Many,many providers and only growing Amazon, Rackspace, Joyent, CouchOne, Cloudant, Azure, GAE, Heroku, no.de Outsourced management Zero capex Controlled costs
  • 5.
    ...With a Chanceof Rain? Vendor lock in Unreliable performance i/o cpu, memory Bare metal > software virtualization
  • 6.
    NoSQL or NOSQL? NotOnly SQL Non/post relational Big tent policy Umbrella term Fragmented http://www.flickr.com/photos/morgennebel/2933723145/
  • 7.
    Your Usage Patterns Readvs. Write Mutable vs. Immutable Product Considerations: In place updates Write Only Logs
  • 8.
    This vs. That Riakwiki comparisons page http://wiki.basho.com/Riak-Comparisons.html Popular one page comparison of a number of NOSQL players by Kristof Kovacs: http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
  • 9.
    NOSQL concepts are Not Brand New Memcached since 2003 http://memcached.org Google papers 2004-2006 Amazon Dynamo 2007 Consistent Hashing 2007 http://www.last.fm/user/RJ/journal/ 2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients Using relational systems as a key-value blob store 2009 FriendFeed (not the first) http://bret.appspot.com/entry/how- friendfeed-uses-mysql
  • 10.
    Why NOSQL Support for“Vary Large” data sets Schemaless Denormalized Green field New applications http://www.flickr.com/photos/gailtang/1243984297/
  • 11.
    Academia Google: Bigtable http://labs.google.com/papers/bigtable.html GFS http://labs.google.com/papers/gfs.html M/R http://labs.google.com/papers/mapreduce.html Amazon: Dynamo http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf NOSQL Summer http://nosqlsummer.org/papers
  • 12.
    Under the Hood Terminology Write Only Log http://en.wikipedia.org/wiki/Log-structured_file_system Merkle Trees http://en.wikipedia.org/wiki/Hash_tree B-trees http://en.wikipedia.org/wiki/B-tree Vector clock http://en.wikipedia.org/wiki/Vector_clock Bloom filters http://en.wikipedia.org/wiki/Bloom_filters Big O Notation http://en.wikipedia.org/wiki/Big_o_notation Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing
  • 13.
    CAP Theorem http://en.wikipedia.org/wiki/CAP_theorem Consistency Availability Partition Tolerance Pick two? http://guide.couchdb.org/draft/consistency.html
  • 14.
    CouchDB CouchOne, Cloudant HTTP interface Erlang Offline usage Extreme replication Sharded scaling scenarios Works on phones Updated indexing (b-tree)
  • 15.
    CouchDB Internal Architecture http://nosqlpedia.com/wiki/File:CouchDB-Arch.JPG
  • 16.
    MongoDB 10Gen, MongoHQ, Soft landing for MongoLab those coming from mysql (relational C++ databases) huMONGOus Native javascript Sharded scaling, Secondary indexes replicated master/ slave Located in NYC (go visit them)
  • 17.
    MongoDB Sharding Diagram http://www.snailinaturtleneck.com/blog/2010/03/30/sharding-with-the-fishes/
  • 18.
    MySQL to MongoQuery similarity http://nosqlpedia.com/wiki/File:MongoDB.JPG
  • 19.
    Riak Basho, Joyent Multiple backends Erlang Homogeneous Distributed CAP tunable HTTP, protobuf Native javascript, erlang
  • 20.
    Hadoop Cloudera, Apache Huge ecosystem Foundation Yahoo, FB, Twitter, Java Fortune 500 High latency Pig, Hive, Flume Batch oriented HDFS is GFS based Open source Google stack via the Google papers
  • 21.
    HBase Java Low latency store sitson top of Hadoop Modeled after Google Bigtable Column oriented Thrift, protobuf Backend for new Facebook Messaging service
  • 22.
    Cassandra Apache Java Column oriented Like Bigtableand Dynamo Originated at Facebook At Twitter, Distributed counting http://www.infoq.com/presentations/NoSQL-at-Twitter-by-Ryan-King http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
  • 23.
    Redis OpenRedis incredibly fast C memcached on steroids REmote DIctionary replicated Server master/slave Specific data structures
  • 24.
    Commonalities Open Source Adherence tocommon or standard: data formats json, bson, utf8, binary data trandport mechanisms http, thrift, protobuf, simple wire protocols
  • 25.
    Ok. So NowWhat? Analyze your requirements Mailing lists IRC, twitter Project pages, wiki Github/Google Code/Bitbucket: project page specific language clients
  • 26.
    Variety Pack Hybrid architectureswill become the norm Twitter - mysql, cassandra, hadoop Google - mysql, GAE (BT) Facebook - mysql, cassandra, hbase, memcached Yahoo - mysql, hadoop LinkedIn - voldemort http://www.flickr.com/photos/uncleweed/82245324/
  • 27.
    Questions? New York CityCloud Computing Group February 2011 Alexander Sicular @siculars