SlideShare a Scribd company logo
NoSQL Databases

  technology watch #2




       Mar 2010
Contents

• Introduction
• Key-Value stores
• Column stores
• Document stores
• Graph Databases
• Benchmarks
• RDF & NoSQL




                     NoSQL Databases   Mar 2010   #2
Contents



       Part I

Introduction




   NoSQL Databases   Mar 2010   #3
Why NoSQL?

• Is NoSQL equal to “SQL “or to “Not Only SQL”?
• ACID doesn’t scale well
• Web apps have different needs (than the apps that RDBMS
  were designed for)
   –   Low and predictable response time (latency)
   –   Scalability & elasticity (at low cost!)
   –   High availability
   –   Flexible schemas / semi-structured data
   –   Geographic distribution (multiple datacenters)
• Web apps can (usually) do without
   – Transactions / strong consistency / integrity
   – Complex queries

                                NoSQL Databases         Mar 2010   #4
Some NoSQL use cases

1. Massive data volumes
  –   Massively distributed architecture required to store the data
  –   Google, Amazon, Yahoo, Facebook – 10-100K servers
2. Extreme query workload
  –   Impossible to efficiently do joins at that scale with an RDBMS
3. Schema evolution
  –   Schema flexibility (migration) is not trivial at large scale
  –   Schema changes can be gradually introduced with NoSQL




                               NoSQL Databases                   Mar 2010   #5
NoSQL pros/cons

•   Advantages
    –   Massive scalability
    –   High availability
    –   Lower cost (than competitive solutions at that scale)
    –   (usually) predictable elasticity
    –   Schema flexibility, sparse & semi-structured data
•   Disadvantages
    –   Limited query capabilities (so far)
    –   Eventual consistency is not intuitive to program for
        •   Makes client applications more complicated
    –   No standardizatrion
        •   Portability might be an issue
    –   Insufficient access control

                                   NoSQL Databases              Mar 2010   #6
CAP theorem

• Requirements to distributed systems
   – Consistency – the system is in a consistent state after an operation
       • All clients see the same data
       • Strong consistency (ACID) vs. eventual consistency (BASE)
   – Availability – the system is “always on”, no downtime
       • Node failure tolerance – all clients can find some available replica
       • software/hardware upgrade tolerance
   – Partition tolerance – the system continues to function even when
     split into disconnected subsets (by a network disruption)
       • Not only for reads, but writes as well!

• CAP Theorem (E. Brewer, N. Lynch)
   – You can satisfy at most 2 out of the 3 requirements



                                  NoSQL Databases                      Mar 2010   #7
CAP (2)

• CA
  – Single site clusters (easier to ensure all nodes are always in contact)
  – e.g. 2PC
  – When a partition occurs, the system blocks
• CP
  – Some data may be inaccessible (availability sacrificed), but the rest is
    still consistent/accurate
  – e.g. sharded database
• AP
  – System is still available under partitioning, but some of the data
    returned my be inaccurate
  – e.g. DNS, caches, Master/Slave replication
  – Need some conflict resolution strategy

                               NoSQL Databases                  Mar 2010       #8
… and also CAE

• CAE trade-off (Amazon?)
   – Cost-efficiency
   – High Availability
   – Elasticity
• Pick any two (Cost, Availability, Elasticity)
   – Cost Efficiency
       • By over-provisioning you get HA and E
   – High Availability
       • Just make the client wait when the system is overloaded (CE and E)
   – Elasticity
       • If you can predict your workload, you can provide HA + CE by booking
         the resources in advance

• Everyone wants HA, so the challenge is providing E in a CE
  way
                                NoSQL Databases                     Mar 2010    #9
NoSQL taxonomy

• Key-Value stores
   – Simple K/V lookups (DHT)
• Column stores
   – Each key is associated with many attributes (columns)
   – NoSQL column stores are actually hybrid row/column stores
       • Different from “pure” relational column stores!

• Document stores
   – Store semi-structured documents (JSON)
   – Map/Reduce based materialisation, sorting, aggregation, etc.
• Graph databases
   – Not exactly NoSQL…
       • can’t satisfy the requirements for High Availability and
         Scalability/Elasticity very well

                                  NoSQL Databases                   Mar 2010   #10
Contents



          Part II

Key-Value stores
PNUTS, Dynamo, Voldemort




       NoSQL Databases     Mar 2010   #11
PNUTS

• Yahoo, ~2008, part of the Y! Sherpa platform
• Requirements
   –   Scale out / elasticity
   –   Geo-replication
   –   High availability / Fault tolerance
   –   Relaxed consistency
• Simplified relational data model
   –   Flexible schema
   –   No Referential Integrity
   –   No joins, aggregation, etc
   –   Updates/deletes only by Primary Key
   –   “Multiget” (retrieve many records by PKs)


                                  NoSQL Databases   Mar 2010   #12
PNUTS – consistency model

• Timeline consistency
   – In-between serializability and eventual consistency
       • All replicas apply updates for a record in the same order
       • but not necessarily at the same time (eventually)
   – One replica is always chosen as a Master (per record!)
       • adaptively changed for load re-balancing (usage pattern tracking!)
       • updates go only to the Master, then async propagation to other replicas
         / reads go to replicas or Master
   – Read consistency levels
       • Read any – may return a stale version of a record (any replica, fast)
       • Read critical (V) – return a newer (or the same) record version than V
       • Read latest – most recent version (goes to Master, slow)
   – Write consistency levels
       • Write
       • Test-and-set (V) – write only if the record version is the same as V
         (ensures serialization)
                                  NoSQL Databases                      Mar 2010    #13
PNUTS – consistency model (2)




                                      (C) Yahoo!




          NoSQL Databases       Mar 2010           #14
PNUTS - architecture

• Regions
   – (usually) geographically distributed replicas (20+ datacenters)
   – Pub/Sub asynchronous replication (Tribble)
• Tablets
   – Horizontal partitioning of the tables (keyspace)
   – One copy of a Tablet within each Region
• Components
   – Routers
       • Cached copy of the interval mapping for tablets
   – Storage Units (SU)
       • Filesystem or MySQL
   – Tablet Controllers (TC)
       • Maintains interval mapping, tablet split, load re-balancing, etc

                                  NoSQL Databases                     Mar 2010   #15
PNUTS – architecture (2)




                              (C) Yahoo!




       NoSQL Databases     Mar 2010        #16
PNUTS – architecture (3)

• Message Broker
   –   Pub/Sub replication
   –   Notification service (even for external apps)
   –   Reliable logging
   –   1 broker per tablet, but subscriptions on the table level possible too
• “Mastership”
   – Record master – updates, “read latest”
   – Tablet master – inserts & deletes
• Query processing
   – Replica – “read any”, “read critical”
   – Master – updates, “read latest”
   – Scatter/Gather engine – range queries, table scans, multi-record
     updates

                                 NoSQL Databases                  Mar 2010      #17
Dynamo

• Amazon, ~2007
• P2P key-value store
   –   Object versioning
   –   Consistent hashing
   –   Gossip – membership & failure detection
   –   Quorum reads
• Requirements & assumptions
   –   Simple query model (unique keys, blobs, no schema, no multi-access)
   –   Scale out (elasticity)
   –   Eventual consistency (improved availability)
   –   Decentralized & symmetric (P2P), heterogeneous (load distribution)
   –   Low latency / high throughput
   –   Internal service (no security model)
                               NoSQL Databases                 Mar 2010      #18
Dynamo - partitioning

• Consistent hashing
   – “Ring” of nodes (random position assigned to nodes when joining)
   – Node leaving/joining the ring affects only its 2 neighbours
   – Virtual nodes (improves the load balancing) – node is assigned
     several positions in the ring
• Replication
   – N replicas (successor nodes on ring)
• Versioning
   –   All updates provide version context
   –   Eventual consistency
   –   Conflict reconciliation by clients (vector clocks)
   –   Vector clock size thresholds (truncation)


                                  NoSQL Databases            Mar 2010     #19
                                                                        (C) Amazon
Dynamo – operations

• Read/Write requests
   – Client sends a request to a random node, the node forwards it to the
     proper node (1st replica responsible for that partition – coordinator)
   – Coordinator will forward R/W requests to top N replicas
• Consistency
   – Quorum protocol
       • R, W – minimum number of nodes that must participate in a Read/ Write
         operation (R+W > N)
   – Writes
       • Coordinator generates the version and the vector clock
       • Coordinator sends requests to N replicas, if W replicas confirm then OK
   – Reads
       • Coordinator sends requests to N replicas, if R replicas respond then OK
       • If different versions are returned then reconcile and write back the
         reconciled version (read repair)
                                 NoSQL Databases                     Mar 2010      #20
Dynamo – failure handling

• “Hinted handoff”
   – Dynamo is “always writable”
   – If a replica node NR is down, the coordinator sends the write data to a
     new node NHH (with a hint that NR was the intended recipient)
   – Hinted writes are not readable, i.e. read quorum may still fail!
   – NHH will keep the data until NR is available again, then send the data
     version to NR and delete the local copy
• “Anti-entropy” replica synchronisation
   – What if a hinted replica NHH fails before it can sync with NR?
   – Merkle trees
       • Pros – minimum data transfer (only the root node hash compared when
         in-sync), very fast location of out-of-sync keys (tree traversal)
       • Cons – trees need to be re-calculated when nodes join/leave the ring
         (keyspace partitioning changes)

                                NoSQL Databases                  Mar 2010       #21
Dynamo – architecture

• Bootstrapping & node removal
   – New node gets a position in the ring
   – Several re-balancing strategies
       • Partition split - the node responsible for that keyspace partition splits it
         and sends half the data to the new node
       • Fixed size partitions – one or more partitions are sent to the new node
   – Similar process when a node leaves the ring
• Node components
   – Request handling & coordination
   – Membership & failure detection
   – Local persistence engine
       • MySQL
       • BerkleyDB
       • In-memory buffer + persistence

                                   NoSQL Databases                       Mar 2010       #22
Voldemort

• http://project-voldemort.com/
• Distributed key-value store
   – Based on Dynamo
• Originally developed by LinkedIn, now open source
• Features
   – Simple data model (no joins or complex queries, no RI, …)
   – P2P
   – Scale-out / elastic
       • Consistent hashing of keyspace
       • Fixed partitions (no splits, but owner may change when re-balancing)
   – Eventual consistency / High Availability
   – Replication
   – Failure handling
                                NoSQL Databases                     Mar 2010    #23
Voldemort & Dynamo

• Similar to Dynamo
   –   Consistent hashing
   –   Versions & Vector clocks -> eventual consistency
   –   R/W quorum (R+W>N)
   –   Hinted handoff (failure handling)
   –   Pluggable persistence engines
        • MySQL, BerkeleyDB, in-memory




                                NoSQL Databases           Mar 2010   #24
Contents



           Part III

 Column stores
BigTable, HBase, Cassandra




        NoSQL Databases      Mar 2010   #25
BigTable

• Google, ~2006
• Sparse, distributed, persistent multidimensional sorted map
• (row, column, timestamp) dimensions, value is string
• Key features
   –   Hybrid row/column store
   –   Single master (stand-by replica)
   –   Versioning
   –   Compression




                                                                (C) Google


                                NoSQL Databases     Mar 2010       #26
BigTable - data model

• Row
   – Keys are arbitrary strings
   – Data is sorted by row key
• Tablet
   – Row range is dynamically partitioned into tablets (sequence of rows)
   – Range scans are very efficient
   – Row keys should be chosen to improve locality of data access
• Column, Column Family
   –   Column keys are arbitrary strings, unlimited number of columns
   –   Column keys can be grouped into families
   –   Data in a CF is stored and compressed together (Locality Groups)
   –   Access control on the CF level


                                  NoSQL Databases               Mar 2010    #27
BigTable - data model (2)

• Timestamps
   – Each cell has multiple versions
   – Can be manually assigned
• Versioning
   – Automated garbage collection
        • Retain last N versions
        • Retain versions newer than TS

• Architecture
   –   Data stored on GFS
   –   Relies on Chubby (distributed lock service, Paxos)
   –   1 Master server
   –   thousands of Tablet servers


                                 NoSQL Databases            Mar 2010   #28
BigTable - architecture

• Master server
   –   Assign tablets to Tablet Servers
   –   Balance TS load
   –   Garbage collection
   –   Schema management
   –   Client data does not move through the MS (directly through TS)
   –   Tablet location not handled by MS
• Tablet server (many)
   – thousands of tablets per TS
   – Manages Read / Write / Split of its tablets




                               NoSQL Databases                 Mar 2010   #29
BigData - workflow

• Tablet location
   –   B+ tree index
   –   root (Chubby file) - location of the Root Metadata tablet
   –   1st level (Root Metadata tablet) – locations of all Metadata tablets
   –   2nd level (Metadata tablet) – location of a set of tablets
        • Each row is SSTable ID + end row ID
   – Clients cache tablet locations




                                                                              (C) Google


                                 NoSQL Databases                  Mar 2010           #30
BigData – workflow (2)

• Tablet access
   – SSTable – persistent, ordered immutable map from keys to values
       • Contains a sequence of blocks + block index
       • Compression at block level
       • 2 phase compression, 10:1 ratio (standard gzip up to 3:1)
   – Updates
       • 1st added to the redo log
       • Then to the in-memory buffer (memtable)
       • If memtable is full, then it is flushed into a new SSTable and emptied
   – Reads
       • Executed over the merged memtable and all SSTables (Bloom filters)
       • Merged view is very efficient (data is already sorted)
       • Bloom filters – probabilistic maps on whether a specific SSTable contains
         a specific cell (row, column), i.e. disk lookups are drastically reduced


                                 NoSQL Databases                      Mar 2010       #31
BigData – workflow (3)




                         (C) Google




      NoSQL Databases         Mar 2010   #32
BigData – workflow (4)

• SSTable compactions
   – Minor compaction
      • Memtable flushed on disk into a new SSTable and emptied
      • When: memtable is full
   – Merging compaction
      •   Merge the memtable and several SSTables into 1+ new SSTable(s)
      •   Then, empty the memtable and delete the input SSTables
      •   New SSTable(s) can still contain deletion info + deleted data (no purging)
      •   When: number of SSTables limit reached
   – Major compaction
      • A merging compaction with exactly 1 output SSTable and all deleted data
        purged
      • When: regularly (background)




                                  NoSQL Databases                      Mar 2010        #33
BigTable - performance

• Benchmark with 1, 50, 250 and 500 Tablet Servers




                                                (C) Google




                         NoSQL Databases              Mar 2010   #34
BigTable, real world data volumes (ca. 2006)




                                                   (C) Google




                 NoSQL Databases        Mar 2010          #35
HBase

• http://hadoop.apache.org/hbase
• Developed by Powerset, now Apache
• Based on BigTable
   – HDFS (GFS), ZooKeeper (Chubby)
   – Master Node (Master Server), Region Servers (Tablet Servers)
   – HStore (tablet), memcache (memtable), MapFile (SSTable)
• Features
   –   Data is stored sorted (no real indexes)
   –   Automatic partitioning
   –   Automatic re-balancing / re-partitioning
   –   Fault tolerance (HDFS, 3 replicas)



                                NoSQL Databases              Mar 2010   #36
HBase - architecture




                       (C) Tom White




     NoSQL Databases          Mar 2010   #37
HBase - operations

• Workflow
   – Hstore/tablet location (== BigTable)
         • clients talk to ZooKeeper
         • root metadata, metadata tables, client side caching, etc
   – Writes (== BigTable)
         • First added to the commit log, then to the memcache (memtable),
           eventually flushed to MapFile (SSTable)
   – Reads (== BigTable)
         • First read from memcache, if info is incomplete then read the MapFiles
   – Compactions (== BigTable)
• APIs
   – Thrift
   – REST
         • GET | PUT | POST | DELETE <table>/<row>/<column>:<qualifier>/<timestamp>

                                   NoSQL Databases                      Mar 2010      #38
HBase – MapReduce jobs

• Can be a source/sink for M/R jobs in Hadoop




                                                (C) ???




                         NoSQL Databases        Mar 2010   #39
HBase vs. BigTable

• http://www.larsgeorge.com/2009/11/hbase-vs-bigtable-
  comparison.html

       Feature                     BigTable                       HBase
    Access control                      +                            -
    Master server                    single                   single/multiple*
    Locality groups                     +                            -
      Cell cache                        +                            -
  Commit logs (WAL)           primary + secondary                  single
Skip WAL for bulk loads                 ?                           yes
    Data isolation                      +                            -
      replication                       +                       In progress
                      Everything else pretty much the same!


                                 NoSQL Databases                    Mar 2010     #40
Cassandra

• http://cassandra.apache.org/
• Developed by Facebook (inbox), now Apache
   – Facebook now developing its own version again
• Based on Google BigTable (data model) and Amazon Dynamo
  (partitioning & consistency)
• P2P
   – Every node is aware of all other nodes in the cluster
• Design goals
   –   High availability
   –   Eventual consistency (improves HA)
   –   Incremental scalability / elasticity
   –   Optimistic replication

                                NoSQL Databases              Mar 2010   #41
Cassandra – data model, partitioning

• Data model
   – Same as BigTable
   – Super Columns (nested Columns) and Super Column Families
   – column order in a CF can be specified (name, time)
• Cluster membership
   – Gossip – every nodes gossips to 1-3 other nodes about the state of
     the cluster (merge incoming info with its own)
   – Changes in the cluster (node in/out, failure) propagate quickly (LogN)
   – Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2))
• Dynamic partitioning
   – Consistent hashing
   – Ring of nodes
   – Nodes can be “moved” on the ring for load balancing

                               NoSQL Databases                 Mar 2010       #42
Cassandra - operations

• Writes (== BigTable… almost)
   – Client sends request to a random node, the node determines the
     actual node responsible for storing the data (cluster awareness)
   – Data replicated to N nodes (config)
   – Data first added to the commit log, then to the memtable, eventually
     flushed to SSTable
• Reads (== BigTable… almost)
   – Send request to a random node, which forwards it to all the N nodes
     having the data
       • Single read – return first response / Quorum read – value that the
         majority of nodes agrees on
   – First read from memtable, if info is incomplete then read the
     SSTables
   – Bloom filters for efficient SSTable lookups
• Compactions (== BigTable) Databases
                         NoSQL                                      Mar 2010   #43
Cassandra @ Facebook

• Inbox search
• ca. 2009 - 50 TB data, 150 nodes, 2 datacenters
• Performance (production)




                                                    (C) facebook




                          NoSQL Databases           Mar 2010       #44
Contents



       Part IV

Document stores
     CouchDB




    NoSQL Databases   Mar 2010   #45
CouchDB

• http://couchdb.apache.org/ , ~2005
• Schema-free, document oriented database
   –   Documents stored in JSON format (XML in old versions)
   –   B-tree storage engine
   –   MVCC model, no locking
   –   no joins, no PK/FK (UUIDs are auto assigned)
   –   Implemented in Erlang
        • 1st version in C++, 2nd in Erlang and 500 times more scalable (source:
          “Erlang Programming” by Cesarini & Thompson)
   – Replication (incremental)
                                   {
• Documents                            “name”: “Ontotext”,
                                       “url”: “www.ontotext.com”
   – UUID, version                     “employees”: 40
   – Old versions retained             “products”: [“OWLIM”, “KIM”, “LifeSKIM”, “JOCI”]
                                   }
                                   NoSQL Databases                      Mar 2010      #46
CouchDB (2)

• REST API
           CRUD               HTTP               params
           Create              PUT               /db/docid
            Read               GET               /db/docid
           Update             POST               /db/docid
           Delete            DELETE              /db/docid

• Views
   –   Filter, sort, “join”, aggregate, report
   –   Map/Reduce based, language independent (JavaScript by default)
   –   K/V pairs from Map/Reduce are also stored in the B-tree engine
   –   Built on demand
   –   Can be materialized & incrementally updated

                               NoSQL Databases               Mar 2010   #47
CouchDB – views




                     (C) Chris Anderson / Apache




   NoSQL Databases           Mar 2010              #48
Contents



     Part V

Benchmarks




  NoSQL Databases   Mar 2010   #49
NoSQL feature matrix (incomplete)




                              (C) R. Ramakrishnan / Yahoo

            NoSQL Databases           Mar 2010              #50
CAP positioning


                         Consistency




            •BigTable
            •HBase




                        •Dynamo
Partition
tolerance               •PNUTS             Availability
                        •Cassandra
                        •Voldemort
                        •CouchDB?




                         NoSQL Databases                  Mar 2010   #51
Yahoo Cloud Serving Benchmark (YCSB)

• Benchmark various NoSQL systems
   – Performance
       • Measure latency/throughput curve on a fixed hardware
   – Scale out (elasticity)
       • add hardware, increase data size / workload proportionally
       • Measure latency (best case: constant)
       • Or… add hardware but same data/workload (latency should drop)
   – Later – replication, availability, …
• Testbed
   – 6 boxes {2x4 core 2.5GHz, 8GB RAM, 6 HDD (SAS RAID1+0), GB eth}
   – 120M records (1K) ~ 20GB data per server
   – 100+ client threads



                                 NoSQL Databases                Mar 2010   #52
YCSB – performance

• Update heavy workload (50/50 reads/updates)
• Reminders
   – Cassandra (Dynamo) is “always writable”
   – HBase buffers updates in memory
   – PNUTS uses MySQL as a persistence engine




                            NoSQL Databases     (C) B. Cooper et al. / Yahoo   #53
                                                     Mar 2010
YCSB – performance (2)

• Read heavy workload
• Reminders
   – Cassandra (Dynamo) will read several replicas for read consistency
   – HBase will access several files during a read (even with Bloom filters)
   – PNUTS uses MySQL as a persistence engine




                               NoSQL Databases                (C) B. Cooper et al. / Yahoo   #54
                                                                   Mar 2010
YCSB – scalability

• Can more servers handle proportionally bigger workload?
   – Best case: latency should be constant
• Notes
   – PNUTS and Cassandra scale very well
   – HBase performance varies a lot




                                                 (C) B. Cooper et al. / Yahoo




                              NoSQL Databases         Mar 2010                  #55
YCSB – elastic speedup

• What is the impact of adding one server (same workload)?
   – Best case: latency should drop
• Notes
   – 3 servers, add 1 more at 5th min (different latency scales!)
   – Cassandra – performance improvement by 10% (quickly!)
   – Hbase – large latency spike initially, then 10% performance
     improvement (within 20 min)
                                                                    (C) B. Cooper et al. / Yahoo




                               NoSQL Databases                 Mar 2010                 #56
Contents



      Part VI

RDF & NoSQL?




   NoSQL Databases   Mar 2010   #57
RDF & NoSQL

• No easy match
   – Partitioning by key not so suitable for RDF (S, P, O)
   – Data locality insufficient for RDF materialisation
   – Full featured query language (SPARQL) cannot be efficiently mapped
     to NoSQL key/value lookups and range scans
   – Eventual consistency is a problem (materialisation with stale data)
• So far RDF limited to the graph databases subset of NoSQL
   – But graph databases don’t provide massive scalability




                              NoSQL Databases                Mar 2010      #58
Useful links

• http://groups.google.com/group/nosql-discussion
• “Bigtable: A Distributed Storage System for Structured Data”
• “Cassandra - A Decentralized Structured Storage System”
• "Dynamo: Amazon’s Highly Available Key-value Store"
• “PNUTS: Yahoo!'s Hosted Data Serving Platform”
• “Benchmarking Cloud Serving Systems with YCSB”
• “HBase vs. BigTable Comparison”




                          NoSQL Databases            Mar 2010    #59
Q&A




Questions?




  NoSQL Databases   Mar 2010   #60

More Related Content

What's hot

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Harri Kauhanen
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
Dr-Dipali Meher
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
Bishal Khanal
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
Saurav Haloi
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Ravi Teja
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
PolarSeven Pty Ltd
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
Heman Hosainpana
 
NoSql
NoSqlNoSql
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
Mike Dirolf
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
valuebound
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
Don Demcsak
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
Navdeep Charan
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
Fabio Fumarola
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
Hyphen Call
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
MongoDB
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL Databases
Osama Jomaa
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
Fabio Fumarola
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
Murat Çakal
 

What's hot (20)

NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Introduction to Apache ZooKeeper
Introduction to Apache ZooKeeperIntroduction to Apache ZooKeeper
Introduction to Apache ZooKeeper
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Key-Value NoSQL Database
Key-Value NoSQL DatabaseKey-Value NoSQL Database
Key-Value NoSQL Database
 
NoSql
NoSqlNoSql
NoSql
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 
The Basics of MongoDB
The Basics of MongoDBThe Basics of MongoDB
The Basics of MongoDB
 
NOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQLNOSQL- Presentation on NoSQL
NOSQL- Presentation on NoSQL
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.A Seminar on NoSQL Databases.
A Seminar on NoSQL Databases.
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
MongoDB presentation
MongoDB presentationMongoDB presentation
MongoDB presentation
 
MongoDB Fundamentals
MongoDB FundamentalsMongoDB Fundamentals
MongoDB Fundamentals
 
SQL vs. NoSQL Databases
SQL vs. NoSQL DatabasesSQL vs. NoSQL Databases
SQL vs. NoSQL Databases
 
5 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/25 Data Modeling for NoSQL 1/2
5 Data Modeling for NoSQL 1/2
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 

Viewers also liked

NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
BADR
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
William LaForest
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
Fabio Fumarola
 
Les modèles NoSQL
Les modèles NoSQLLes modèles NoSQL
Les modèles NoSQLebiznext
 
NoSQL et Big Data
NoSQL et Big DataNoSQL et Big Data
NoSQL et Big Data
acogoluegnes
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
gdusbabek
 
Introduction aux bases de données NoSQL
Introduction aux bases de données NoSQLIntroduction aux bases de données NoSQL
Introduction aux bases de données NoSQL
Antoine Augusti
 
Big Data: Hadoop Map / Reduce sur Windows et Windows Azure
Big Data: Hadoop Map / Reduce sur Windows et Windows AzureBig Data: Hadoop Map / Reduce sur Windows et Windows Azure
Big Data: Hadoop Map / Reduce sur Windows et Windows Azure
Microsoft
 
NoSQL: Introducción a las Bases de Datos no estructuradas
NoSQL: Introducción a las Bases de Datos no estructuradasNoSQL: Introducción a las Bases de Datos no estructuradas
NoSQL: Introducción a las Bases de Datos no estructuradas
Diego López-de-Ipiña González-de-Artaza
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
Lorenzo Alberton
 
Techday Arrow Group: Hadoop & le Big Data
Techday Arrow Group: Hadoop & le Big DataTechday Arrow Group: Hadoop & le Big Data
Techday Arrow Group: Hadoop & le Big Data
Arrow Group
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
Mathieu Dumoulin
 
Architectures techniques NoSQL
Architectures techniques NoSQLArchitectures techniques NoSQL
Architectures techniques NoSQLOCTO Technology
 
Une introduction à HBase
Une introduction à HBaseUne introduction à HBase
Une introduction à HBase
Modern Data Stack France
 
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Hatim CHAHDI
 
Les BD NoSQL
Les BD NoSQLLes BD NoSQL
Les BD NoSQL
Minyar Sassi Hidri
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystème
Khanh Maudoux
 

Viewers also liked (18)

NoSQL Databases
NoSQL DatabasesNoSQL Databases
NoSQL Databases
 
An Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDBAn Introduction to Big Data, NoSQL and MongoDB
An Introduction to Big Data, NoSQL and MongoDB
 
NoSQL databases pros and cons
NoSQL databases pros and consNoSQL databases pros and cons
NoSQL databases pros and cons
 
Les modèles NoSQL
Les modèles NoSQLLes modèles NoSQL
Les modèles NoSQL
 
NoSQL et Big Data
NoSQL et Big DataNoSQL et Big Data
NoSQL et Big Data
 
Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)Introduction to Cassandra (June 2010)
Introduction to Cassandra (June 2010)
 
Introduction aux bases de données NoSQL
Introduction aux bases de données NoSQLIntroduction aux bases de données NoSQL
Introduction aux bases de données NoSQL
 
Big Data: Hadoop Map / Reduce sur Windows et Windows Azure
Big Data: Hadoop Map / Reduce sur Windows et Windows AzureBig Data: Hadoop Map / Reduce sur Windows et Windows Azure
Big Data: Hadoop Map / Reduce sur Windows et Windows Azure
 
NoSQL: Introducción a las Bases de Datos no estructuradas
NoSQL: Introducción a las Bases de Datos no estructuradasNoSQL: Introducción a las Bases de Datos no estructuradas
NoSQL: Introducción a las Bases de Datos no estructuradas
 
NoSQL Databases: Why, what and when
NoSQL Databases: Why, what and whenNoSQL Databases: Why, what and when
NoSQL Databases: Why, what and when
 
Techday Arrow Group: Hadoop & le Big Data
Techday Arrow Group: Hadoop & le Big DataTechday Arrow Group: Hadoop & le Big Data
Techday Arrow Group: Hadoop & le Big Data
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
 
Architectures techniques NoSQL
Architectures techniques NoSQLArchitectures techniques NoSQL
Architectures techniques NoSQL
 
Une introduction à HBase
Une introduction à HBaseUne introduction à HBase
Une introduction à HBase
 
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
Cours HBase et Base de Données Orientées Colonnes (HBase, Column Oriented Dat...
 
Les BD NoSQL
Les BD NoSQLLes BD NoSQL
Les BD NoSQL
 
Introduction à Hadoop
Introduction à HadoopIntroduction à Hadoop
Introduction à Hadoop
 
Hadoop et son écosystème
Hadoop et son écosystèmeHadoop et son écosystème
Hadoop et son écosystème
 

Similar to NoSQL databases

No sql databases
No sql databasesNo sql databases
No sql databases
swathika rajan
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
Clarence J M Tauro
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
Daniel Austin
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
LinkedIn
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
Mat Keep
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
RithikRaj25
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
Qian Lin
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
Iraklis Psaroudakis
 
No sql
No sqlNo sql
No sql
Murat Çakal
 
Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)
Jason Brown
 
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey
Insight Technology, Inc.
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
Rahul Borate
 
Data Storage Solutions for SNS game
Data Storage Solutions for SNS gameData Storage Solutions for SNS game
Data Storage Solutions for SNS game
we20
 
Data storage solutions for sns game
Data storage solutions for sns gameData storage solutions for sns game
Data storage solutions for sns game
Son Aris
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical data
Oleksandr Semenov
 
Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating Netflix
Jason Brown
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
Ivan Zoratti
 

Similar to NoSQL databases (20)

No sql databases
No sql databasesNo sql databases
No sql databases
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
NoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed DatabaseNoSQL A brief look at Apache Cassandra Distributed Database
NoSQL A brief look at Apache Cassandra Distributed Database
 
Yes sql08 inmemorydb
Yes sql08 inmemorydbYes sql08 inmemorydb
Yes sql08 inmemorydb
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInJay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn
 
PayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL ClusterPayPal Big Data and MySQL Cluster
PayPal Big Data and MySQL Cluster
 
NoSQL.pptx
NoSQL.pptxNoSQL.pptx
NoSQL.pptx
 
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
A Survey of Advanced Non-relational Database Systems: Approaches and Applicat...
 
Bigtable and Dynamo
Bigtable and DynamoBigtable and Dynamo
Bigtable and Dynamo
 
No sql
No sqlNo sql
No sql
 
Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)Cassandra from the trenches: migrating Netflix (update)
Cassandra from the trenches: migrating Netflix (update)
 
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey
[db tech showcase Tokyo 2016] E32: My Life as a Disruptor by Jim Starkey
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Data Storage Solutions for SNS game
Data Storage Solutions for SNS gameData Storage Solutions for SNS game
Data Storage Solutions for SNS game
 
Data storage solutions for sns game
Data storage solutions for sns gameData storage solutions for sns game
Data storage solutions for sns game
 
Cassandra for mission critical data
Cassandra for mission critical dataCassandra for mission critical data
Cassandra for mission critical data
 
Cassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating NetflixCassandra from the trenches: migrating Netflix
Cassandra from the trenches: migrating Netflix
 
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More FlexibilityNOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
NOSQL Meets Relational - The MySQL Ecosystem Gains More Flexibility
 

More from Marin Dimitrov

Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Marin Dimitrov
 
Mapping Your Career Journey
Mapping Your Career JourneyMapping Your Career Journey
Mapping Your Career Journey
Marin Dimitrov
 
Open Source @ Uber
Open Source @ Uber Open Source @ Uber
Open Source @ Uber
Marin Dimitrov
 
Trust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & OrganisationsTrust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & Organisations
Marin Dimitrov
 
Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018
Marin Dimitrov
 
Machine Learning @ Uber
Machine Learning @ UberMachine Learning @ Uber
Machine Learning @ Uber
Marin Dimitrov
 
Career Advice for My Younger Self
Career Advice for My Younger SelfCareer Advice for My Younger Self
Career Advice for My Younger Self
Marin Dimitrov
 
Scaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed SitesScaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed Sites
Marin Dimitrov
 
Building, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance TeamsBuilding, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance Teams
Marin Dimitrov
 
Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)
Marin Dimitrov
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL Queries
Marin Dimitrov
 
DataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-ServiceDataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-Service
Marin Dimitrov
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the Cloud
Marin Dimitrov
 
Low-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceLow-cost Open Data As-a-Service
Low-cost Open Data As-a-Service
Marin Dimitrov
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
Marin Dimitrov
 
RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4
Marin Dimitrov
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
Marin Dimitrov
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
Marin Dimitrov
 
S4: The Self-Service Semantic Suite
S4: The Self-Service Semantic SuiteS4: The Self-Service Semantic Suite
S4: The Self-Service Semantic Suite
Marin Dimitrov
 
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the CloudScaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
Marin Dimitrov
 

More from Marin Dimitrov (20)

Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...Measuring the Productivity of Your Engineering Organisation - the Good, the B...
Measuring the Productivity of Your Engineering Organisation - the Good, the B...
 
Mapping Your Career Journey
Mapping Your Career JourneyMapping Your Career Journey
Mapping Your Career Journey
 
Open Source @ Uber
Open Source @ Uber Open Source @ Uber
Open Source @ Uber
 
Trust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & OrganisationsTrust - the Key Success Factor for Teams & Organisations
Trust - the Key Success Factor for Teams & Organisations
 
Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018Uber @ Telerik Academy 2018
Uber @ Telerik Academy 2018
 
Machine Learning @ Uber
Machine Learning @ UberMachine Learning @ Uber
Machine Learning @ Uber
 
Career Advice for My Younger Self
Career Advice for My Younger SelfCareer Advice for My Younger Self
Career Advice for My Younger Self
 
Scaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed SitesScaling Your Engineering Organization with Distributed Sites
Scaling Your Engineering Organization with Distributed Sites
 
Building, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance TeamsBuilding, Scaling and Leading High-Performance Teams
Building, Scaling and Leading High-Performance Teams
 
Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)Uber @ Career Days 2017 (Sofia University)
Uber @ Career Days 2017 (Sofia University)
 
GraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL QueriesGraphDB Connectors – Powering Complex SPARQL Queries
GraphDB Connectors – Powering Complex SPARQL Queries
 
DataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-ServiceDataGraft Platform: RDF Database-as-a-Service
DataGraft Platform: RDF Database-as-a-Service
 
On-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the CloudOn-Demand RDF Graph Databases in the Cloud
On-Demand RDF Graph Databases in the Cloud
 
Low-cost Open Data As-a-Service
Low-cost Open Data As-a-ServiceLow-cost Open Data As-a-Service
Low-cost Open Data As-a-Service
 
Text Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-ServiceText Analytics & Linked Data Management As-a-Service
Text Analytics & Linked Data Management As-a-Service
 
RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4RDF Database-as-a-Service with S4
RDF Database-as-a-Service with S4
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Enabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and ReuseEnabling Low-cost Open Data Publishing and Reuse
Enabling Low-cost Open Data Publishing and Reuse
 
S4: The Self-Service Semantic Suite
S4: The Self-Service Semantic SuiteS4: The Self-Service Semantic Suite
S4: The Self-Service Semantic Suite
 
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the CloudScaling to Millions of Concurrent SPARQL Queries on the Cloud
Scaling to Millions of Concurrent SPARQL Queries on the Cloud
 

Recently uploaded

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Wask
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
David Brossard
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
jpupo2018
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
Mariano Tinti
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
Zilliz
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 

Recently uploaded (20)

Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Digital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying AheadDigital Marketing Trends in 2024 | Guide for Staying Ahead
Digital Marketing Trends in 2024 | Guide for Staying Ahead
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
OpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - AuthorizationOpenID AuthZEN Interop Read Out - Authorization
OpenID AuthZEN Interop Read Out - Authorization
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Project Management Semester Long Project - Acuity
Project Management Semester Long Project - AcuityProject Management Semester Long Project - Acuity
Project Management Semester Long Project - Acuity
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Mariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceXMariano G Tinti - Decoding SpaceX
Mariano G Tinti - Decoding SpaceX
 
Fueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte WebinarFueling AI with Great Data with Airbyte Webinar
Fueling AI with Great Data with Airbyte Webinar
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 

NoSQL databases

  • 1. NoSQL Databases technology watch #2 Mar 2010
  • 2. Contents • Introduction • Key-Value stores • Column stores • Document stores • Graph Databases • Benchmarks • RDF & NoSQL NoSQL Databases Mar 2010 #2
  • 3. Contents Part I Introduction NoSQL Databases Mar 2010 #3
  • 4. Why NoSQL? • Is NoSQL equal to “SQL “or to “Not Only SQL”? • ACID doesn’t scale well • Web apps have different needs (than the apps that RDBMS were designed for) – Low and predictable response time (latency) – Scalability & elasticity (at low cost!) – High availability – Flexible schemas / semi-structured data – Geographic distribution (multiple datacenters) • Web apps can (usually) do without – Transactions / strong consistency / integrity – Complex queries NoSQL Databases Mar 2010 #4
  • 5. Some NoSQL use cases 1. Massive data volumes – Massively distributed architecture required to store the data – Google, Amazon, Yahoo, Facebook – 10-100K servers 2. Extreme query workload – Impossible to efficiently do joins at that scale with an RDBMS 3. Schema evolution – Schema flexibility (migration) is not trivial at large scale – Schema changes can be gradually introduced with NoSQL NoSQL Databases Mar 2010 #5
  • 6. NoSQL pros/cons • Advantages – Massive scalability – High availability – Lower cost (than competitive solutions at that scale) – (usually) predictable elasticity – Schema flexibility, sparse & semi-structured data • Disadvantages – Limited query capabilities (so far) – Eventual consistency is not intuitive to program for • Makes client applications more complicated – No standardizatrion • Portability might be an issue – Insufficient access control NoSQL Databases Mar 2010 #6
  • 7. CAP theorem • Requirements to distributed systems – Consistency – the system is in a consistent state after an operation • All clients see the same data • Strong consistency (ACID) vs. eventual consistency (BASE) – Availability – the system is “always on”, no downtime • Node failure tolerance – all clients can find some available replica • software/hardware upgrade tolerance – Partition tolerance – the system continues to function even when split into disconnected subsets (by a network disruption) • Not only for reads, but writes as well! • CAP Theorem (E. Brewer, N. Lynch) – You can satisfy at most 2 out of the 3 requirements NoSQL Databases Mar 2010 #7
  • 8. CAP (2) • CA – Single site clusters (easier to ensure all nodes are always in contact) – e.g. 2PC – When a partition occurs, the system blocks • CP – Some data may be inaccessible (availability sacrificed), but the rest is still consistent/accurate – e.g. sharded database • AP – System is still available under partitioning, but some of the data returned my be inaccurate – e.g. DNS, caches, Master/Slave replication – Need some conflict resolution strategy NoSQL Databases Mar 2010 #8
  • 9. … and also CAE • CAE trade-off (Amazon?) – Cost-efficiency – High Availability – Elasticity • Pick any two (Cost, Availability, Elasticity) – Cost Efficiency • By over-provisioning you get HA and E – High Availability • Just make the client wait when the system is overloaded (CE and E) – Elasticity • If you can predict your workload, you can provide HA + CE by booking the resources in advance • Everyone wants HA, so the challenge is providing E in a CE way NoSQL Databases Mar 2010 #9
  • 10. NoSQL taxonomy • Key-Value stores – Simple K/V lookups (DHT) • Column stores – Each key is associated with many attributes (columns) – NoSQL column stores are actually hybrid row/column stores • Different from “pure” relational column stores! • Document stores – Store semi-structured documents (JSON) – Map/Reduce based materialisation, sorting, aggregation, etc. • Graph databases – Not exactly NoSQL… • can’t satisfy the requirements for High Availability and Scalability/Elasticity very well NoSQL Databases Mar 2010 #10
  • 11. Contents Part II Key-Value stores PNUTS, Dynamo, Voldemort NoSQL Databases Mar 2010 #11
  • 12. PNUTS • Yahoo, ~2008, part of the Y! Sherpa platform • Requirements – Scale out / elasticity – Geo-replication – High availability / Fault tolerance – Relaxed consistency • Simplified relational data model – Flexible schema – No Referential Integrity – No joins, aggregation, etc – Updates/deletes only by Primary Key – “Multiget” (retrieve many records by PKs) NoSQL Databases Mar 2010 #12
  • 13. PNUTS – consistency model • Timeline consistency – In-between serializability and eventual consistency • All replicas apply updates for a record in the same order • but not necessarily at the same time (eventually) – One replica is always chosen as a Master (per record!) • adaptively changed for load re-balancing (usage pattern tracking!) • updates go only to the Master, then async propagation to other replicas / reads go to replicas or Master – Read consistency levels • Read any – may return a stale version of a record (any replica, fast) • Read critical (V) – return a newer (or the same) record version than V • Read latest – most recent version (goes to Master, slow) – Write consistency levels • Write • Test-and-set (V) – write only if the record version is the same as V (ensures serialization) NoSQL Databases Mar 2010 #13
  • 14. PNUTS – consistency model (2) (C) Yahoo! NoSQL Databases Mar 2010 #14
  • 15. PNUTS - architecture • Regions – (usually) geographically distributed replicas (20+ datacenters) – Pub/Sub asynchronous replication (Tribble) • Tablets – Horizontal partitioning of the tables (keyspace) – One copy of a Tablet within each Region • Components – Routers • Cached copy of the interval mapping for tablets – Storage Units (SU) • Filesystem or MySQL – Tablet Controllers (TC) • Maintains interval mapping, tablet split, load re-balancing, etc NoSQL Databases Mar 2010 #15
  • 16. PNUTS – architecture (2) (C) Yahoo! NoSQL Databases Mar 2010 #16
  • 17. PNUTS – architecture (3) • Message Broker – Pub/Sub replication – Notification service (even for external apps) – Reliable logging – 1 broker per tablet, but subscriptions on the table level possible too • “Mastership” – Record master – updates, “read latest” – Tablet master – inserts & deletes • Query processing – Replica – “read any”, “read critical” – Master – updates, “read latest” – Scatter/Gather engine – range queries, table scans, multi-record updates NoSQL Databases Mar 2010 #17
  • 18. Dynamo • Amazon, ~2007 • P2P key-value store – Object versioning – Consistent hashing – Gossip – membership & failure detection – Quorum reads • Requirements & assumptions – Simple query model (unique keys, blobs, no schema, no multi-access) – Scale out (elasticity) – Eventual consistency (improved availability) – Decentralized & symmetric (P2P), heterogeneous (load distribution) – Low latency / high throughput – Internal service (no security model) NoSQL Databases Mar 2010 #18
  • 19. Dynamo - partitioning • Consistent hashing – “Ring” of nodes (random position assigned to nodes when joining) – Node leaving/joining the ring affects only its 2 neighbours – Virtual nodes (improves the load balancing) – node is assigned several positions in the ring • Replication – N replicas (successor nodes on ring) • Versioning – All updates provide version context – Eventual consistency – Conflict reconciliation by clients (vector clocks) – Vector clock size thresholds (truncation) NoSQL Databases Mar 2010 #19 (C) Amazon
  • 20. Dynamo – operations • Read/Write requests – Client sends a request to a random node, the node forwards it to the proper node (1st replica responsible for that partition – coordinator) – Coordinator will forward R/W requests to top N replicas • Consistency – Quorum protocol • R, W – minimum number of nodes that must participate in a Read/ Write operation (R+W > N) – Writes • Coordinator generates the version and the vector clock • Coordinator sends requests to N replicas, if W replicas confirm then OK – Reads • Coordinator sends requests to N replicas, if R replicas respond then OK • If different versions are returned then reconcile and write back the reconciled version (read repair) NoSQL Databases Mar 2010 #20
  • 21. Dynamo – failure handling • “Hinted handoff” – Dynamo is “always writable” – If a replica node NR is down, the coordinator sends the write data to a new node NHH (with a hint that NR was the intended recipient) – Hinted writes are not readable, i.e. read quorum may still fail! – NHH will keep the data until NR is available again, then send the data version to NR and delete the local copy • “Anti-entropy” replica synchronisation – What if a hinted replica NHH fails before it can sync with NR? – Merkle trees • Pros – minimum data transfer (only the root node hash compared when in-sync), very fast location of out-of-sync keys (tree traversal) • Cons – trees need to be re-calculated when nodes join/leave the ring (keyspace partitioning changes) NoSQL Databases Mar 2010 #21
  • 22. Dynamo – architecture • Bootstrapping & node removal – New node gets a position in the ring – Several re-balancing strategies • Partition split - the node responsible for that keyspace partition splits it and sends half the data to the new node • Fixed size partitions – one or more partitions are sent to the new node – Similar process when a node leaves the ring • Node components – Request handling & coordination – Membership & failure detection – Local persistence engine • MySQL • BerkleyDB • In-memory buffer + persistence NoSQL Databases Mar 2010 #22
  • 23. Voldemort • http://project-voldemort.com/ • Distributed key-value store – Based on Dynamo • Originally developed by LinkedIn, now open source • Features – Simple data model (no joins or complex queries, no RI, …) – P2P – Scale-out / elastic • Consistent hashing of keyspace • Fixed partitions (no splits, but owner may change when re-balancing) – Eventual consistency / High Availability – Replication – Failure handling NoSQL Databases Mar 2010 #23
  • 24. Voldemort & Dynamo • Similar to Dynamo – Consistent hashing – Versions & Vector clocks -> eventual consistency – R/W quorum (R+W>N) – Hinted handoff (failure handling) – Pluggable persistence engines • MySQL, BerkeleyDB, in-memory NoSQL Databases Mar 2010 #24
  • 25. Contents Part III Column stores BigTable, HBase, Cassandra NoSQL Databases Mar 2010 #25
  • 26. BigTable • Google, ~2006 • Sparse, distributed, persistent multidimensional sorted map • (row, column, timestamp) dimensions, value is string • Key features – Hybrid row/column store – Single master (stand-by replica) – Versioning – Compression (C) Google NoSQL Databases Mar 2010 #26
  • 27. BigTable - data model • Row – Keys are arbitrary strings – Data is sorted by row key • Tablet – Row range is dynamically partitioned into tablets (sequence of rows) – Range scans are very efficient – Row keys should be chosen to improve locality of data access • Column, Column Family – Column keys are arbitrary strings, unlimited number of columns – Column keys can be grouped into families – Data in a CF is stored and compressed together (Locality Groups) – Access control on the CF level NoSQL Databases Mar 2010 #27
  • 28. BigTable - data model (2) • Timestamps – Each cell has multiple versions – Can be manually assigned • Versioning – Automated garbage collection • Retain last N versions • Retain versions newer than TS • Architecture – Data stored on GFS – Relies on Chubby (distributed lock service, Paxos) – 1 Master server – thousands of Tablet servers NoSQL Databases Mar 2010 #28
  • 29. BigTable - architecture • Master server – Assign tablets to Tablet Servers – Balance TS load – Garbage collection – Schema management – Client data does not move through the MS (directly through TS) – Tablet location not handled by MS • Tablet server (many) – thousands of tablets per TS – Manages Read / Write / Split of its tablets NoSQL Databases Mar 2010 #29
  • 30. BigData - workflow • Tablet location – B+ tree index – root (Chubby file) - location of the Root Metadata tablet – 1st level (Root Metadata tablet) – locations of all Metadata tablets – 2nd level (Metadata tablet) – location of a set of tablets • Each row is SSTable ID + end row ID – Clients cache tablet locations (C) Google NoSQL Databases Mar 2010 #30
  • 31. BigData – workflow (2) • Tablet access – SSTable – persistent, ordered immutable map from keys to values • Contains a sequence of blocks + block index • Compression at block level • 2 phase compression, 10:1 ratio (standard gzip up to 3:1) – Updates • 1st added to the redo log • Then to the in-memory buffer (memtable) • If memtable is full, then it is flushed into a new SSTable and emptied – Reads • Executed over the merged memtable and all SSTables (Bloom filters) • Merged view is very efficient (data is already sorted) • Bloom filters – probabilistic maps on whether a specific SSTable contains a specific cell (row, column), i.e. disk lookups are drastically reduced NoSQL Databases Mar 2010 #31
  • 32. BigData – workflow (3) (C) Google NoSQL Databases Mar 2010 #32
  • 33. BigData – workflow (4) • SSTable compactions – Minor compaction • Memtable flushed on disk into a new SSTable and emptied • When: memtable is full – Merging compaction • Merge the memtable and several SSTables into 1+ new SSTable(s) • Then, empty the memtable and delete the input SSTables • New SSTable(s) can still contain deletion info + deleted data (no purging) • When: number of SSTables limit reached – Major compaction • A merging compaction with exactly 1 output SSTable and all deleted data purged • When: regularly (background) NoSQL Databases Mar 2010 #33
  • 34. BigTable - performance • Benchmark with 1, 50, 250 and 500 Tablet Servers (C) Google NoSQL Databases Mar 2010 #34
  • 35. BigTable, real world data volumes (ca. 2006) (C) Google NoSQL Databases Mar 2010 #35
  • 36. HBase • http://hadoop.apache.org/hbase • Developed by Powerset, now Apache • Based on BigTable – HDFS (GFS), ZooKeeper (Chubby) – Master Node (Master Server), Region Servers (Tablet Servers) – HStore (tablet), memcache (memtable), MapFile (SSTable) • Features – Data is stored sorted (no real indexes) – Automatic partitioning – Automatic re-balancing / re-partitioning – Fault tolerance (HDFS, 3 replicas) NoSQL Databases Mar 2010 #36
  • 37. HBase - architecture (C) Tom White NoSQL Databases Mar 2010 #37
  • 38. HBase - operations • Workflow – Hstore/tablet location (== BigTable) • clients talk to ZooKeeper • root metadata, metadata tables, client side caching, etc – Writes (== BigTable) • First added to the commit log, then to the memcache (memtable), eventually flushed to MapFile (SSTable) – Reads (== BigTable) • First read from memcache, if info is incomplete then read the MapFiles – Compactions (== BigTable) • APIs – Thrift – REST • GET | PUT | POST | DELETE <table>/<row>/<column>:<qualifier>/<timestamp> NoSQL Databases Mar 2010 #38
  • 39. HBase – MapReduce jobs • Can be a source/sink for M/R jobs in Hadoop (C) ??? NoSQL Databases Mar 2010 #39
  • 40. HBase vs. BigTable • http://www.larsgeorge.com/2009/11/hbase-vs-bigtable- comparison.html Feature BigTable HBase Access control + - Master server single single/multiple* Locality groups + - Cell cache + - Commit logs (WAL) primary + secondary single Skip WAL for bulk loads ? yes Data isolation + - replication + In progress Everything else pretty much the same! NoSQL Databases Mar 2010 #40
  • 41. Cassandra • http://cassandra.apache.org/ • Developed by Facebook (inbox), now Apache – Facebook now developing its own version again • Based on Google BigTable (data model) and Amazon Dynamo (partitioning & consistency) • P2P – Every node is aware of all other nodes in the cluster • Design goals – High availability – Eventual consistency (improves HA) – Incremental scalability / elasticity – Optimistic replication NoSQL Databases Mar 2010 #41
  • 42. Cassandra – data model, partitioning • Data model – Same as BigTable – Super Columns (nested Columns) and Super Column Families – column order in a CF can be specified (name, time) • Cluster membership – Gossip – every nodes gossips to 1-3 other nodes about the state of the cluster (merge incoming info with its own) – Changes in the cluster (node in/out, failure) propagate quickly (LogN) – Probabilistic failure detection (sliding window, Exp(α) or Nor(μ,σ2)) • Dynamic partitioning – Consistent hashing – Ring of nodes – Nodes can be “moved” on the ring for load balancing NoSQL Databases Mar 2010 #42
  • 43. Cassandra - operations • Writes (== BigTable… almost) – Client sends request to a random node, the node determines the actual node responsible for storing the data (cluster awareness) – Data replicated to N nodes (config) – Data first added to the commit log, then to the memtable, eventually flushed to SSTable • Reads (== BigTable… almost) – Send request to a random node, which forwards it to all the N nodes having the data • Single read – return first response / Quorum read – value that the majority of nodes agrees on – First read from memtable, if info is incomplete then read the SSTables – Bloom filters for efficient SSTable lookups • Compactions (== BigTable) Databases NoSQL Mar 2010 #43
  • 44. Cassandra @ Facebook • Inbox search • ca. 2009 - 50 TB data, 150 nodes, 2 datacenters • Performance (production) (C) facebook NoSQL Databases Mar 2010 #44
  • 45. Contents Part IV Document stores CouchDB NoSQL Databases Mar 2010 #45
  • 46. CouchDB • http://couchdb.apache.org/ , ~2005 • Schema-free, document oriented database – Documents stored in JSON format (XML in old versions) – B-tree storage engine – MVCC model, no locking – no joins, no PK/FK (UUIDs are auto assigned) – Implemented in Erlang • 1st version in C++, 2nd in Erlang and 500 times more scalable (source: “Erlang Programming” by Cesarini & Thompson) – Replication (incremental) { • Documents “name”: “Ontotext”, “url”: “www.ontotext.com” – UUID, version “employees”: 40 – Old versions retained “products”: [“OWLIM”, “KIM”, “LifeSKIM”, “JOCI”] } NoSQL Databases Mar 2010 #46
  • 47. CouchDB (2) • REST API CRUD HTTP params Create PUT /db/docid Read GET /db/docid Update POST /db/docid Delete DELETE /db/docid • Views – Filter, sort, “join”, aggregate, report – Map/Reduce based, language independent (JavaScript by default) – K/V pairs from Map/Reduce are also stored in the B-tree engine – Built on demand – Can be materialized & incrementally updated NoSQL Databases Mar 2010 #47
  • 48. CouchDB – views (C) Chris Anderson / Apache NoSQL Databases Mar 2010 #48
  • 49. Contents Part V Benchmarks NoSQL Databases Mar 2010 #49
  • 50. NoSQL feature matrix (incomplete) (C) R. Ramakrishnan / Yahoo NoSQL Databases Mar 2010 #50
  • 51. CAP positioning Consistency •BigTable •HBase •Dynamo Partition tolerance •PNUTS Availability •Cassandra •Voldemort •CouchDB? NoSQL Databases Mar 2010 #51
  • 52. Yahoo Cloud Serving Benchmark (YCSB) • Benchmark various NoSQL systems – Performance • Measure latency/throughput curve on a fixed hardware – Scale out (elasticity) • add hardware, increase data size / workload proportionally • Measure latency (best case: constant) • Or… add hardware but same data/workload (latency should drop) – Later – replication, availability, … • Testbed – 6 boxes {2x4 core 2.5GHz, 8GB RAM, 6 HDD (SAS RAID1+0), GB eth} – 120M records (1K) ~ 20GB data per server – 100+ client threads NoSQL Databases Mar 2010 #52
  • 53. YCSB – performance • Update heavy workload (50/50 reads/updates) • Reminders – Cassandra (Dynamo) is “always writable” – HBase buffers updates in memory – PNUTS uses MySQL as a persistence engine NoSQL Databases (C) B. Cooper et al. / Yahoo #53 Mar 2010
  • 54. YCSB – performance (2) • Read heavy workload • Reminders – Cassandra (Dynamo) will read several replicas for read consistency – HBase will access several files during a read (even with Bloom filters) – PNUTS uses MySQL as a persistence engine NoSQL Databases (C) B. Cooper et al. / Yahoo #54 Mar 2010
  • 55. YCSB – scalability • Can more servers handle proportionally bigger workload? – Best case: latency should be constant • Notes – PNUTS and Cassandra scale very well – HBase performance varies a lot (C) B. Cooper et al. / Yahoo NoSQL Databases Mar 2010 #55
  • 56. YCSB – elastic speedup • What is the impact of adding one server (same workload)? – Best case: latency should drop • Notes – 3 servers, add 1 more at 5th min (different latency scales!) – Cassandra – performance improvement by 10% (quickly!) – Hbase – large latency spike initially, then 10% performance improvement (within 20 min) (C) B. Cooper et al. / Yahoo NoSQL Databases Mar 2010 #56
  • 57. Contents Part VI RDF & NoSQL? NoSQL Databases Mar 2010 #57
  • 58. RDF & NoSQL • No easy match – Partitioning by key not so suitable for RDF (S, P, O) – Data locality insufficient for RDF materialisation – Full featured query language (SPARQL) cannot be efficiently mapped to NoSQL key/value lookups and range scans – Eventual consistency is a problem (materialisation with stale data) • So far RDF limited to the graph databases subset of NoSQL – But graph databases don’t provide massive scalability NoSQL Databases Mar 2010 #58
  • 59. Useful links • http://groups.google.com/group/nosql-discussion • “Bigtable: A Distributed Storage System for Structured Data” • “Cassandra - A Decentralized Structured Storage System” • "Dynamo: Amazon’s Highly Available Key-value Store" • “PNUTS: Yahoo!'s Hosted Data Serving Platform” • “Benchmarking Cloud Serving Systems with YCSB” • “HBase vs. BigTable Comparison” NoSQL Databases Mar 2010 #59
  • 60. Q&A Questions? NoSQL Databases Mar 2010 #60