Your SlideShare is downloading. ×
0
Database         민형기(S-Core)  hg.min@samsung.com            2013. 2. 22.
ContentsI. NoSQL 개요II. NoSQL 기본개념III.NoSQL 관련논문IV.NoSQL 종류V. NoSQL 성능측정VI.NoSQL 정리                   1
NoSQL 개요           2
Thinking – Extreme DataQcon London 2012           3
Organizations need deeper insightsQcon London 2012                      4
New Trendshttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when   5
NoSQL 정의 비관계형, 분산, 오픈소스, 수평 확장성을 주요 특징으로 갖는 차세대 데이터베이스                                                 – nosql-database.or...
NoSQL 기술 현황               Gartner’s 2012 Hype Cycle for Big Data출처 : Gartner                                            7
NoSQL 유래   □ 1998년: Carlo Strozzi가 SQL interface를 지원하지 않은 가벼운(lightweight)     오픈소스 관계형DB를 NoSQL이라 정의          시스템 구조를 단순...
Why NoSQL?  □ACID doesn’t scale well  □Web apps have diffent needs (than the apps that RDBMS   were designed for)       L...
관계형 데이터베이스의 문제 I – 확장성 문제□ Replication - 복제에 의한 확장   Master-Slave 구조       결과를 슬레이브의 개수만큼 복제함. 특정 시점이 지나면 한계가 됨       읽...
관계형 데이터베이스의 문제 I – 확장성 문제http://www.slideshare.net/quipo/nosql-databases-why-what-and-when   11
관계형 데이터베이스의 문제 II – 필요없는 특징들□ UPDATE와 DELETE   정보의 손실이 발생하기 때문에 잘 사용되지 않음       Auditing이나 re-activation을 위해서 기록이 필요함   ...
관계형 데이터베이스의 문제 III – 필요없는 특징들□ ACID 트랜잭션   Atomic(원자성): 여러 레코드를 수정할 때 원자성은 필요 없으며 단일키 원자성이면 충분   Consistency(일관성): 대부분의 ...
NoSQL Features    □Scale horizontally “simple Operations”       Key lookups, reads and writes of one record or a small   ...
NoSQL Use Cases □Massive data Volumes   Massively distributed architecture required to store the data   Google, Amazon, ...
NoSQL 기본개념    CAP 정리    ACID vs. BASE    Isolation Levels    MVCC    Distributed Transaction                         ...
CAP 정리 I 2000 Prof. Eric Brewer PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) □분산 시스템이 ...
CAP 정리 II                              Availability                관계형                                            DHT 기반  ...
CAP 정리 III – Partition Tolerence vs. Availability     "The network will be allowed to loss arbitrarily many messages sent ...
ACID Atomicity: All or nothing. Consistency: Consistent state of data and transactions. Isolation: Transactions are iso...
BASE – ACID alternative Basically available: Nodes in the a distributed  environment can go down, but the whole system  s...
ACID vs. BASE               ACID                         BASE    □Strong Consistency           □Weak Consistency    □Isola...
Isolation Levels     □Read Uncommitted                   aka (NOLOCK)                   Does not issue shared lock, does...
Isolation Levels     □Repeatable Read                   Locks data being read, prevents updates/deletes                 ...
Dirty Readshttp://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels   25
Non Repeatable Readshttp://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels   26
Phantom Readshttp://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels   27
Isolation Levels vs. Reads Phenoma                                                                Non-repeatable   Phantom...
Isolation Levels vs. Locks         Isolation Level                         Range Lock     Read Lock   Write Lock    Read U...
Multi Version Concurrency Control                                                                   Root                  ...
Multi Version Concurrency Control          obsolete                                                                   Root...
Distributed Transactions – 2PC    □Voting Phase: each site is polled as to whether a     transactions should commit (ie: w...
Distributed Transactions –2PC Protocol Actionshttp://www.slideshare.net/atali/2011-db-distributed   33
Distributed Transactions –2PC Protocol Actionshttp://www.slideshare.net/j_singh/cs-542-concurrency-control-distributed-com...
NoSQL 관련논문  Amazon Dynamo  Google BigTable                     35
Amazon Dynamo   Motivation   Key Features   Consistent Hashing   Virtual Nodes   Vector Clocks   Gossip Protocols & ...
Amazon Dynamo - Motivation     □Vast Distributed System             Tens of millions of customers             Tens of th...
Key Features 1/2     □Amazon, ~2007     □Highly-available key-value storage system              99.9995% of request      ...
Key Features 2/2     □Gossip protocol for:              Failure detection              Membership protocol     □Service ...
Techniques used in Dynamo          Consistent Hashing          Vector Clocks          Gossip Protocols          Hinted...
Modulo-based Hashing                        N1                                     N2         N3    N4                    ...
Modulo-based Hashing                        N1                                     N2        N3        N4                 ...
Consistent Hashing                              2128                0                                                     ...
Consistent Hashing                              2128                0                                                     ...
Consistent Hashing - Replication                              2128                0                                       ...
Consistent Hashing – Node Changes                              2128                0                                      ...
Virtual Nodes     Random assignment leads to:        Non-uniform data        Uneven load distribution     Solution: “v...
Virtual Nodes – Load Distribution                              2128                0                                      ...
Virtual Nodes – Load Distributionhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when   49
Replication & Consistency (Quorum)N = number of nodes with a replica of the dataW = number of replicas that must acknowled...
Vector Clocks & Conflict Detection                                             Causality-based partial                    ...
Vector Clocks & Conflict Detection                                            Vector Clocks can detect                    ...
Gossip Protocol + Hinted Handoff                                            A             F                               ...
Gossip Protocol + Hinted Handoff                                                   A             F                        ...
Gossip Protocol + Hinted Handoff                                                        My canonical node is              ...
Merkle Trees (Hash Trees)                                         Leaves: hashes of                                       ...
Merkle Trees (Hash Trees)                    Node A                                                                       ...
Read Repair                                            A             F                                                    ...
Read Repair                                                    K=XYZ(v.2)                                            A    ...
Read Repair                                            A             F                                                    ...
Google Bigtable   Motivation   Key Features   System Architecture   Building Blocks   Data Model   SSTable/Tablet/Ta...
Motivation      □Lots of (semi-)structured data at Google               URLs:                      Contents crawl metada...
Key Features      □Google, ~2006      □Distributed multi-level map      □Fault-tolerant, persistent      □Scalable        ...
System Architecturehttp://www.slideshare.net/kingherc/bigtable-and-dynamo   64
Building Blocks      □Building blocks:              Google File System (GFS): Raw storage              Scheduler: Google...
Google File System      □Large-scale distributed “filesystem”      □Master: responsible for metadata      □Chunk servers: ...
Chubby      □Distributed Lock Service      □File System {directory/file}, for locking                    Coarse-grained l...
Data model      □“Sparse, distributed, persistent, multidim. sorted map”      □<Row, Column, Timestamp> triple for key - l...
SSTable      □Immutable, sorted file of key-value pairs      □Chunks of data plus an index              Index is of block...
Tablet      □Contains some range of rows of the table                    Dynamically partitioned range of rows      □Buil...
Table      □Multiple tablets(table segments) make up the table      □SSTables SSTables can be shared can be shared      □T...
Tablets & Splitting      Large tables broken into tablets at row boundaries                                               ...
Finding a Tablet     Approach: 3-level hierarchical lookup scheme for tablets        – Location is ip:port of relevant ser...
Servers Tablet servers manage tablets, multiple tablets per  server. Each tablet is 100-200 megs  –Each tablet lives at o...
BigTable I/O                                        memtable                                 read                         ...
Compactions Minor compaction – convert the memtable  into an SSTable  Reduce memory usage  Reduce log traffic on restar...
Locality Groups Group column families together into an  SSTable  –Avoid mingling data, ie page contents and page   metada...
NoSQL 종류   Key-value Stores   Column Database   Document Database   Graph Database                        78
NoSQL 분류표http://nosql.mypopescu.com/post/2335288905/nosql-databases-genealogy   79
NoSQL Landscape451 group          80
NoSQL 종류 I   □Key-Value Stores    □Based on DHTs / Amazon’s Dynamo paper    □Data model: (global) collection of K-V pairs ...
NoSQL 종류 II   □Document Store    □Inspired by Lotus Notes    □Data model: collections of K-V collections    □Example: Couc...
NoSQL Data Model 비교http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/   83
Key-value Stores    Voldemort    Riak    Redis                   84
Voldemort                                                                                       AP   □ LinkedIn, 2009, Apa...
Voldemort Pros / Cons                                                                                  AP       Pros      ...
Voldemort : Logical Architecture                                                                                          ...
Voldemort: Physical Architecture Options               APhttp://www.project-voldemort.com/voldemort/design.html        88
Riak                                                                                                                      ...
Riak Pros / Cons                                                                                                 AP       ...
Riak                                                                                    AP                                ...
Riak Logical Architecture                                                  APhttp://bcho.tistory.com/621, http://basho.com...
Redis                                                                                                                     ...
Redis Pros / Cons                                                                                         CP       Pros   ...
Redis                                                                                                                     ...
Column Families Stores       Cassandra       HBase                         96
Cassandra                                                             AP   □ ASF, 2008, Apache 2.0, Java   □ Model: Column...
Cassandra Pros / Cons                                                                                      AP       Pros  ...
Cassandra                                                           AP   Data model of BigTable, infrastructure of Dynamo ...
Cassandra                                                           AP   Data model of BigTable, infrastructure of Dynamo ...
Cassandra                                                           AP   Data model of BigTable, infrastructure of Dynamo ...
Cassandra                                                                                                   AP   Data mode...
Cassandra                                                                                                       AP   Data ...
Cassandra - Data Model                                                               AP                                   ...
Cassandra - Data Model                                                                                                  AP...
Cassandra – Use Case: eBay                                          AP   □ A glimpse on eBay’s Cassandra deployment      □...
HBase                                                                                                CP   □ ASF, 2010, Apa...
HBase Pros / Cons                                                                                        CP       Pros    ...
HBase vs. BigTable Terminology                                                                CP                          ...
HBase: Architecture                                                                                                       ...
HBase: Architecture - WAL                                                                                                 ...
HBase: Architecture - WAL                                                                                                 ...
HBase -                   Usecase: Facebook Message Service                                                            CP ...
Hbase –                  Usecase: Adobe                                         CP   □ When we started pushing 40 million ...
Document Stores    MongoDB    CouchDB                  115
MongoDB                                                                                              CP   □ 10gen, 2009, A...
MongoDB Pros / Cons                                                                                       CP       Pros   ...
MongoDB                                                                          CP                                       ...
MongoDB - Architecture                                                                                            CP   •  ...
CouchDB                                                                                               AP   □ ASF, 2005, Ap...
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
NoSQL Database
Upcoming SlideShare
Loading in...5
×

NoSQL Database

10,375

Published on

NoSQL Database

0 Comments
33 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
10,375
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
284
Comments
0
Likes
33
Embeds 0
No embeds

No notes for slide

Transcript of "NoSQL Database"

  1. 1. Database 민형기(S-Core) hg.min@samsung.com 2013. 2. 22.
  2. 2. ContentsI. NoSQL 개요II. NoSQL 기본개념III.NoSQL 관련논문IV.NoSQL 종류V. NoSQL 성능측정VI.NoSQL 정리 1
  3. 3. NoSQL 개요 2
  4. 4. Thinking – Extreme DataQcon London 2012 3
  5. 5. Organizations need deeper insightsQcon London 2012 4
  6. 6. New Trendshttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 5
  7. 7. NoSQL 정의 비관계형, 분산, 오픈소스, 수평 확장성을 주요 특징으로 갖는 차세대 데이터베이스 – nosql-database.org□ NO! SQL Not Only SQL  관계형 데이터베이스의 한계를 극복하기 위한 데이터 저장소의 새로운 형태  최근에는 빅데이터 처리 & 분산시스템 문제에 집중! Google Trends - nosql 6
  8. 8. NoSQL 기술 현황 Gartner’s 2012 Hype Cycle for Big Data출처 : Gartner 7
  9. 9. NoSQL 유래 □ 1998년: Carlo Strozzi가 SQL interface를 지원하지 않은 가벼운(lightweight) 오픈소스 관계형DB를 NoSQL이라 정의  시스템 구조를 단순화 시킴  시스템을 이기종 장비 에도 이식시킬 수 있게 함  쉘과 같은 툴로 임의의 UNIX 환경에서 구동시킬 수 있음  표준 상용 제품보다 기능은 줄이면서 가격은 저렴하게 함  데이터 필드 크기, 컬럼 등의 제약을 없앰 □ 2009년: 랙스페이스 직원인 Eric Evans가 오픈소스, 분산, 비관계형DB 이벤트에서 NoSQL을 재언급함. 기존 관계형 DBMS와 다른 특징으로 규정함  비관계형 (Non-relational)  분산 (Distributed)  ACID 미지원 □ 2011년: UnQL(Unstructurred Query Language) 활동 시작출처 : Wikipedia, Dataversity 8
  10. 10. Why NoSQL? □ACID doesn’t scale well □Web apps have diffent needs (than the apps that RDBMS were designed for)  Low and predictable response time(latency)  Scalability & Elasticity (at low cost!)  High Availability  Flexible schema / semi-structured data  Geographic distribution (multiple datacenters) □Web apps can (usually) do not  Transaction / strong consistency / integrity  Complex queueshttp://www.slideshare.net/marin_dimitrov/nosql-databases-3584443 9
  11. 11. 관계형 데이터베이스의 문제 I – 확장성 문제□ Replication - 복제에 의한 확장  Master-Slave 구조  결과를 슬레이브의 개수만큼 복제함. 특정 시점이 지나면 한계가 됨  읽기(Read)는 빠르지만 쓰기(Write)는 하나의 노드에 대해서만 일어나기 때문에 병목 현상이 발생함  Master에서 slave로 퍼지는데 시간이 소요되기 때문에 중요한(Critical)한 읽기는 여전 히 Master에서 읽어야 하고, 이것은 어플리케이션 개발에 고려가 필요함  데이터 규모가 큰 경우에는 N번 복제를 해야 하기 때문에 문제가 발생할 소지가 있음. 이것은 Master-Slave 방식으로 확장성에 대한 제한을 가지게 됨  Master-Master 구조  Master를 추가함으로써 쓰기성능을 향상할 수 있으나 충돌이 발생할 가능성이 있음  충돌 가능성은 O(N3) 또는 O(N2) 에 비례함 http://research.microsoft.com/~gray/replicas.ps□ Partitioning(Sharding) - 분할에 의한 확장  Read만큼 Write도 확장할 수 있지만 애플리케이션에서 파티션된 것을 인지하고 있어야 함  RDBMS의 가치는 관계에 있다고 할 수 있는데 파티션을 하면 이 관계가 깨져버리고 각 파 티션된 조각간에 조인을 할 수 없기 때문에 관계에 대한 부분은 애플리케이션 레이어에서 책임져야 합니다.  일반적으로 RDBMS에서 수동 Sharding 은 쉽지 않다. 10
  12. 12. 관계형 데이터베이스의 문제 I – 확장성 문제http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 11
  13. 13. 관계형 데이터베이스의 문제 II – 필요없는 특징들□ UPDATE와 DELETE  정보의 손실이 발생하기 때문에 잘 사용되지 않음  Auditing이나 re-activation을 위해서 기록이 필요함  일반적으로 도메인 관점에서 삭제(deleted)나 갱신(update)는 사용되지 않음  UPDATE나 DELETE는 INSERT와 version으로 모델 할 수 있음  데이터가 많아지면 비활성(inactive) 데이터는 archive함  INSERT-only 시스템에서는 2개의 문제 존재  데이터베이스에서 종속(cascade)에 대한 트리거를 이용할 수 없음  Query가 비활성 데이터를 걸러내야 할 필요가 있음□ JOIN  피해야 하는 이유: 데이터가 많을 때 JOIN은 많은 양의 데이터에 복잡한 연산을 수행해야 하기 때문에 비용이 많이 들며 파티션을 넘어서는 동작하지 않기 때문  피하는 방법  정규화의 목적: 일관된 데이터를 가지기 쉽게 하고 스토리지의 양을 줄이기 위함  반정규화(de-normalization)를 하면 JOIN 문제를 피할 수 있음. 반정규화로 일관성에 대한 책임을 DB에서 어플리케이션으로 이동시킬 수 있는데 이는 INSERT-only라면 어 렵지 않음 12
  14. 14. 관계형 데이터베이스의 문제 III – 필요없는 특징들□ ACID 트랜잭션  Atomic(원자성): 여러 레코드를 수정할 때 원자성은 필요 없으며 단일키 원자성이면 충분  Consistency(일관성): 대부분의 시스템은 C보다는 P나 A를 필요로 하기 때문에 엄격한 일관 성을 가질 필요는 없고 대신 결과적 일관성(Eventually Consistent)을 가질 수 잇음  Isolation(격리성): Read-Committeed 이상의 격리성은 필요하지 않으며 단일키 원자성이 더 쉽다.  Durability(지속성): 각 노드가 실패했을 때도 이용되기 위해서는 메모리가 데이터를 충분히 보관할 수 있을 정도로 저렴해지는 시점까지는 지속성이 필요함□ 고정된 스키마(Fixed Schema)  RDBMS에서는 데이터를 사용하기 전에 스키마를 정의해야 함: Table, Index등을 정의해야 하는데  스키마 수정은 기본: 현재의 웹환경에서는 빠르게 새로운 피쳐를 추가하고 이미 존재하는 피쳐를 조정하기 위해서는 스키마 수정이 필수적으로 요구됨  스키마 수정의 어려움: 컬럼의 추가/수정/삭제는 row에 lock을 걸고 index의 수정은 테이블 에 lock을 걸기 때문□ 일부 없는 특성  계층화나 그래프를 모델하는 것은 어려움  빠른 응답을 위해서 디스크를 피하고 메인 메모리에서 데이터를 제공하는 것이 바람직한데 대부분의 관계형 데이터베이스는 디스크기반이기 때문에 쿼리들이 디스크에서 수행됨 13
  15. 15. NoSQL Features □Scale horizontally “simple Operations”  Key lookups, reads and writes of one record or a small number of records, simple selections □Replicate/distribute data over many servers □Simple call level interface (constrast w/SQL) □Weaker concurrency model than ACID  Eventual Consistency  BASE □Efficient use of distributed indexes and RAM □Flexible Schemahttp://www.cs.washington.edu/education/courses/cse444/12sp/lectures/lecture26-nosql.pdf 14
  16. 16. NoSQL Use Cases □Massive data Volumes  Massively distributed architecture required to store the data  Google, Amazon, Yahoo, Facebook – 10K ~ 100K servers □Extremely query workload  Impossible to efficiently do joins at the scale with an RDBMS □Schema evolution  Schema flexibility(migration) is not trivial at large scale  Schema changes can be gradually introduced with NoSQL 15
  17. 17. NoSQL 기본개념  CAP 정리  ACID vs. BASE  Isolation Levels  MVCC  Distributed Transaction 16
  18. 18. CAP 정리 I 2000 Prof. Eric Brewer PoDC Conference Keynote 2002 Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2) □분산 시스템이 보장해야 할 3가지 특성 □ Consistency: 각각의 사용자가 항상 동일한 데이터를 조회한다. □ Availability: 모든 사용자가 항상 읽고 쓸 수 있다. □ Partition tolerance: 물리적 네트워크 분산 환경에서 시스템이 잘 동작한다. □분산 시스템에서는 적절한 시간에 2가지 특성만 만족할 수 있 다. 17
  19. 19. CAP 정리 II Availability 관계형 DHT 기반 분산환경에서 적절한 Dynamo, Cassandra RDBMS 응답시간 이내에 세가지 속성을 만족시키는 저장소는 구성하기 어렵다. Partition Consistency Tolerence 파티셔닝 기반 BigTable, Hbase, MongoDB 분산 시스템에서의 네트워크 분할은 반드시 대비해야 한다. 따라서 실제로는 어떤 것을 포기할지에 대해 두 개의 선택권만 있다. – Werner Vogels (아마존 CTO) http://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-SigAct.pdf 18
  20. 20. CAP 정리 III – Partition Tolerence vs. Availability "The network will be allowed to loss arbitrarily many messages sent from one node to another"[...]" "For a distributed system to be continuously available, every request received by a non-failing node in the system must result in a response“ - Gilbert and Lynch, SIGACT 2002  CP: request can complete at nodes that have quoram  AP: requests can complete at any live node, possibly violating strong consistencyhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 19
  21. 21. ACID Atomicity: All or nothing. Consistency: Consistent state of data and transactions. Isolation: Transactions are isolated from each other. Durability: When the transaction is committed, state will be durable. Any data store can achieve Atomicity, Isolation and Durability but do you always need consistency? No. By giving up ACID properties, one can achieve higher performance and scalability. 20
  22. 22. BASE – ACID alternative Basically available: Nodes in the a distributed environment can go down, but the whole system shouldn’t be affected. Soft State (scalable): The state of the system and data changes over time, even without input. This is because of the eventual consistency model. Eventual Consistency: Given enough time, data will be consistent across the distributed system. 21
  23. 23. ACID vs. BASE ACID BASE □Strong Consistency □Weak Consistency □Isolation □Availability first □Focus on “commit” □Best effort □Nested transactions □Approximated answers □Less Availability □Aggressive(optimistic) □Conservative (pessimistic) □Simpler! □Difficult evolution (e.g. □Faster schema) □Easier evolution출처 : Brewer 22
  24. 24. Isolation Levels □Read Uncommitted  aka (NOLOCK)  Does not issue shared lock, does not honor exclusive lock  Rows can be updated/inserted/deleted before transaction ends  Least restrictive □Read Committed  Holds shared Lock  Cannot read uncommitted data, but data can be changed before end of transaction, resulting in non repeatable read or phantom rowshttp://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 23
  25. 25. Isolation Levels □Repeatable Read  Locks data being read, prevents updates/deletes  New rows can be inserted during transaction, will be included in later reads □Serializable  Aka HOLDLOCK on all tables in SELECT  Locks range of data being read, no modifications are possible  Prevents updates/deletes/inserts  Most restrictivehttp://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 24
  26. 26. Dirty Readshttp://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels 25
  27. 27. Non Repeatable Readshttp://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels 26
  28. 28. Phantom Readshttp://www.slideshare.net/ErnestoHernandezRodriguez/transaction-isolation-levels 27
  29. 29. Isolation Levels vs. Reads Phenoma Non-repeatable Phantom Isolation Level Dirty Reads Reads Reads Read Uncommitted Read Committed - Repeatable Read - - Serializable - - -http://en.wikipedia.org/wiki/Isolation_(database_systems) 28
  30. 30. Isolation Levels vs. Locks Isolation Level Range Lock Read Lock Write Lock Read Uncommitted - - - Read Committed Exclusive Shared - Repeatable Read Exclusive Exclusive - Serializable Exclusive Exclusive Exclusivehttp://en.wikipedia.org/wiki/Isolation_(database_systems) 29
  31. 31. Multi Version Concurrency Control Root Index Index Index Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data Datahttp://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 30
  32. 32. Multi Version Concurrency Control obsolete Root atomic pointer update new version marked for compaction Index Index Reads: never Index Index Index Index blocked Index Index Index Index Index Index Index Index Data Data Data Data Data Data Data Data Data Data Data Data Data Data Datahttp://www.adayinthelifeof.nl/2010/12/20/innodb-isolation-levels 31
  33. 33. Distributed Transactions – 2PC □Voting Phase: each site is polled as to whether a transactions should commit (ie: whether their sub- transaction can commit) □Decision Phase: if any site says “abort” or does not reply, then all sites must be told to abort □Logging is performed for failure recovery (as usual)http://www.slideshare.net/atali/2011-db-distributed 32
  34. 34. Distributed Transactions –2PC Protocol Actionshttp://www.slideshare.net/atali/2011-db-distributed 33
  35. 35. Distributed Transactions –2PC Protocol Actionshttp://www.slideshare.net/j_singh/cs-542-concurrency-control-distributed-commit 34
  36. 36. NoSQL 관련논문  Amazon Dynamo  Google BigTable 35
  37. 37. Amazon Dynamo Motivation Key Features Consistent Hashing Virtual Nodes Vector Clocks Gossip Protocols & Hinted Handoffs Read Repair 36
  38. 38. Amazon Dynamo - Motivation □Vast Distributed System  Tens of millions of customers  Tens of thousands of servers  Failure is a normal case □Outage means  Lost Customer Trust  Financial loses □Goal: great customer experience  Always Available  Fast  Reliable  Scalablehttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 37
  39. 39. Key Features 1/2 □Amazon, ~2007 □Highly-available key-value storage system  99.9995% of request  Targeted for primary key access and small values(< 1MB) □Scalable and decentralized □Gives tight control over tradeoffs between:  Availability, consistency, performance □Data partitioned using consistent hashing □Consistency facilitated by object versioning  Quorum-like technique for replicas consistency  Decentralized replica synchronization protocol  Eventual consistencyhttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf, http://www.slideshare.net/kingherc/bigtable-and-dynamo 38
  40. 40. Key Features 2/2 □Gossip protocol for:  Failure detection  Membership protocol □Service Level Agreements(SLAs)  Include client’s expected request rate distribution and expected service latency  e.g.: Response time < 300ms, for 99.9% of requests, for a peak load of 500 requests/sec.  Example: Managing shopping carts. Write /read and available across multiple data centers. □Trusted network, no authentication □Incremental scalability □Symmetry □Heterogeneity, Load distributionhttp://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf, http://www.slideshare.net/kingherc/bigtable-and-dynamo 39
  41. 41. Techniques used in Dynamo  Consistent Hashing  Vector Clocks  Gossip Protocols  Hinted Handoffs  Read Repair  Merkle Trees Problem Technique Advantage Partitioning Consistent Hashing Incremental Scalability High Availability for Vector clocks with reconciliation during Version size is decoupled from update rates. writes reads Handling temporary failures Provides high availability and durability guarantee Sloppy Quorum and hinted handoff when some of the replicas are not available. Recovering from Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. permanent failures Preserves symmetry and avoids having a centralized Membership and failure Gossip-based membership protocol and registry for storing membership and node liveness detection failure detection. information.http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf 40
  42. 42. Modulo-based Hashing N1 N2 N3 N4 ? partition = key % n_servershttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 41
  43. 43. Modulo-based Hashing N1 N2 N3 N4 ? partition = key % (n_servers – 1) Recalculate the hashes for all entries if n_servers changes (i.e. full data redistribution when adding/removing a node)http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 42
  44. 44. Consistent Hashing 2128 0 hash(key) A Same hash function F for data and nodes B idx = hash(key) Ring (Key Space) Coordinator: next E available clockwise node C Dhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 43
  45. 45. Consistent Hashing 2128 0 hash(key) A Same hash function F for data and nodes B idx = hash(key) Ring (Key Space) E Coordinator: next available clockwise node C Dhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 44
  46. 46. Consistent Hashing - Replication 2128 0 KeyAB hosted A in B, C, D F Data replication in B the N-I clockwise Ring successor nodes (Key Space) E C Node hosting KeyFA, KeyAB, KeyBC Dhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 45
  47. 47. Consistent Hashing – Node Changes 2128 0 Key membership A and replicas are updated when a F node joins or leaves the network. Copy KeyAB B The number of replicas for all data is kept consistent. E Copy KeyFA C Copy KeyEF Dhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 46
  48. 48. Virtual Nodes  Random assignment leads to:  Non-uniform data  Uneven load distribution  Solution: “virtual nodes”  A single node (physical machine) is assigned multiple random positions (“tokens”) on the ring.  On failure of a node, the load is evenly dispersed  On joining, a node accepts an equivalent load  Number of virtual nodes assigned to a physical node can be decided based on its capacity.http://www.slideshare.net/kingherc/bigtable-and-dynamo 47
  49. 49. Virtual Nodes – Load Distribution 2128 0 Different Strategies A I Virtual Nodes H Random tokens per each B Ring physical node, partition by token value (Key Space) G C Node 1: tokens A, E, G D Node 2: tokens C, F, H Node 3: tokens B, D, I F Ehttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 48
  50. 50. Virtual Nodes – Load Distributionhttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 49
  51. 51. Replication & Consistency (Quorum)N = number of nodes with a replica of the dataW = number of replicas that must acknowledge the update(*)R = minimum number of replicas that must participate in asuccessful read operation (*) but the data will be written to N nodes no matter whatW+R>N Strong Consistency (usually N=3, R=W=2)W=N, R=1 Optimised for readsW=1, R=N Optimised for writes (durability not guranteed in presence of failures)W+R<N Weak Consistency Latency is determined by the slowest of the R replicas for read, W replicas for write. 50
  52. 52. Vector Clocks & Conflict Detection Causality-based partial order over events that happen in the system. Document version history: a counter for each node that updated the document. If all update counters in V1 are smaller or equal to all update counters in V2, then V1 precedes V2http://en.wikipedia.org/wiki/Vector_clock 51
  53. 53. Vector Clocks & Conflict Detection Vector Clocks can detect a conflict. The conflict resolution is left to the application or the user. The application might resolve conflicts by checking relative timestamps, or with other strategies (like merging the changes). Vector clocks can grow quite large (!)http://en.wikipedia.org/wiki/Vector_clock 52
  54. 54. Gossip Protocol + Hinted Handoff A F periodic, pairwise, inter-process interactions of B Ring bounded size (Key Space) among randomly- E chosen peers C Dhttp://en.wikipedia.org/wiki/Vector_clock 53
  55. 55. Gossip Protocol + Hinted Handoff A F periodic, pairwise, I cant see B, it might be inter-process down but I need some ACK. My Merkle Tree interactions of B root for range XY is bounded size "ab03Idab4a385afda" among randomly- E chosen peers I cant see B either. My Merkle Tree root for range XY is different! B must be down C then. Let’s disable it. Dhttp://en.wikipedia.org/wiki/Vector_clock 54
  56. 56. Gossip Protocol + Hinted Handoff My canonical node is A supposed to be B. F periodic, pairwise, inter-process interactions of B bounded size among randomly- E chosen peers C I see. Well, Ill take care of it for now, and let B know when B is available again Dhttp://en.wikipedia.org/wiki/Vector_clock 55
  57. 57. Merkle Trees (Hash Trees) Leaves: hashes of data blocks. Nodes: hashes of their children. Used to detect inconsistencies between replicas (anti-entropy) and to minimise the amount of transferred datahttp://en.wikipedia.org/wiki/Hash_tree 56
  58. 58. Merkle Trees (Hash Trees) Node A Node B gossip exchange Minimal data transfer Differences are easy to locate SHA-1, Whirlpool or Tiger (TTH) hash functionshttp://www.slideshare.net/quipo/modern-algorithms-and-data-structures-1-bloom-filters-merkle-trees 57
  59. 59. Read Repair A F B GET(k, R=2) E C Dhttp://en.wikipedia.org/wiki/Vector_clock 58
  60. 60. Read Repair K=XYZ(v.2) A F K=XYZ(v.2) B GET(k, R=2) E C K=ABC(v.1) Dhttp://en.wikipedia.org/wiki/Vector_clock 59
  61. 61. Read Repair A F B GET(k, R=2) E UPDATE(k, XYZ) C Dhttp://en.wikipedia.org/wiki/Vector_clock 60
  62. 62. Google Bigtable Motivation Key Features System Architecture Building Blocks Data Model SSTable/Tablet/Table IO / Compaction 61
  63. 63. Motivation □Lots of (semi-)structured data at Google  URLs:  Contents crawl metadata links anchors pagerank , …  Per-user data:  User preference settings, recent queries/search results, …  Geographic locations:  Physical entities (shops, restaurants, etc.), roads, satellite image data, user annotations, … □Scale is large  Billions of URLs, many versions/page (~20K/version)  Hundreds of millions of users, thousands of q/sec  100TB+ of satellite image datahttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 62
  64. 64. Key Features □Google, ~2006 □Distributed multi-level map □Fault-tolerant, persistent □Scalable  Thousands of servers  Terabytes of in-memory data  Petabyte of disk-based data  Millions of reads/writes per second, efficient scans □Self-managing  Servers can be added/removed dynamically  Servers adjust to load imbalancehttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 63
  65. 65. System Architecturehttp://www.slideshare.net/kingherc/bigtable-and-dynamo 64
  66. 66. Building Blocks □Building blocks:  Google File System (GFS): Raw storage  Scheduler: Google Work Queue, schedules jobs onto machines  Lock service: Chubby, distributed lock manager  MapReduce: simplified large-scale data processing □BigTable uses of building blocks:  GFS: stores persistent data (SSTable file format for storage of data)  Scheduler: schedules jobs involved in BigTable serving  Lock service: master election, location bootstrapping  Map Reduce: often used to read/write BigTable datahttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 65
  67. 67. Google File System □Large-scale distributed “filesystem” □Master: responsible for metadata □Chunk servers: responsible for reading and writing large chunks of data □Chunks replicated on 3 machines, master responsible for ensuring replicas exist □OSDI ’04 Paperhttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 66
  68. 68. Chubby □Distributed Lock Service □File System {directory/file}, for locking  Coarse-grained locks, can store small amount of data in a lock □High Availability  5 replicas, one elected as master  Service live when majority is live  Uses Paxos algorithm to solve consensus □A client leases a session with the service □Also an OSDI ’06 Paperhttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 67
  69. 69. Data model □“Sparse, distributed, persistent, multidim. sorted map” □<Row, Column, Timestamp> triple for key - lookup, insert, and delete API □Arbitrary “columns” on a row-by-row basis  Column family:qualifier. Family is heavyweight, qualifier lightweight  Column-oriented physical store- rows are sparse! □Does not support a relational model  No table-wide integrity constraints  No multirow transactionshttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.berkeley.edu/~kubitron/cs262/lectures/lec23-Pond-BigTable.pdf 68
  70. 70. SSTable □Immutable, sorted file of key-value pairs □Chunks of data plus an index  Index is of block ranges, not values  triplicated across three machines in GFS SSTable 64K 64K 64K block block block Indexhttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 69
  71. 71. Tablet □Contains some range of rows of the table  Dynamically partitioned range of rows □Built out of multiple SSTables □Typical size: 100~200MB □Tablets are stored in Tablet Servers (~100 per server) □Unit of distribution and load balancing Tablet Start:aardvark End:apple SSTable SSTable 64K 64K 64K 64K 64K 64K block block block block block block Index Indexhttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 70
  72. 72. Table □Multiple tablets(table segments) make up the table □SSTables SSTables can be shared can be shared □Tablets do not overlap, SSTables can overlap Tablet Tablet aardvark apple apple_two_E boat SSTable SSTable SSTable SSTablehttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 71
  73. 73. Tablets & Splitting Large tables broken into tablets at row boundaries “language” “contents” aaa.com cnn.com EN “<html>…” cnn.com/sports.html TABLETS … Website.com … Zuppa.com/menu.htmlhttp://labs.google.com/papers/bigtable-osdi06.pdf, http://www.cs.wisc.edu/areas/os/Seminar/schedules/archive/bigtable.ppt 72
  74. 74. Finding a Tablet Approach: 3-level hierarchical lookup scheme for tablets – Location is ip:port of relevant server, all stored in META tablets – 1st level: bootstrapped from lock server, points to owner of META0 – 2nd level: Uses META0 data to find owner of appropriate META1 tablet – 3rd level: META1 table holds locations of tablets of all other tables META1 table itself can be split into multiple tabletshttp://labs.google.com/papers/bigtable-osdi06.pdf 73
  75. 75. Servers Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 megs –Each tablet lives at only one server –Tablet server splits tablets that get too big Master responsible for load balancing and fault tolerance –Use Chubby to monitor health of tablet servers, restart failed servers –GFS replicates data. Prefer to start tablet server on same machine that the data is already at 74
  76. 76. BigTable I/O memtable read minor memory compaction GFS tablet log SSTable SSTable SSTable BMDiff Zippy write Merging / Major Compaction (GC) □ Commit log stores the writes  Recent writes are stores in the memtable  Older writes are stores in SSTables □ A read operation sees a merged view of the memtable and the SSTables □ Checks authorization from ACL stored in Chubbyhttp://www.slideshare.net/kingherc/bigtable-and-dynamo 75
  77. 77. Compactions Minor compaction – convert the memtable into an SSTable Reduce memory usage Reduce log traffic on restart Merging compaction Reduce number of SSTables Good place to apply policy “keep only N versions” Major compaction Merging compaction that results in only one SSTable No deletion records, only live data 76
  78. 78. Locality Groups Group column families together into an SSTable –Avoid mingling data, ie page contents and page metadata –Can keep some groups all in memory Can compress locality groups Bloom Filters on locality groups – avoid searching SSTable 77
  79. 79. NoSQL 종류 Key-value Stores Column Database Document Database Graph Database 78
  80. 80. NoSQL 분류표http://nosql.mypopescu.com/post/2335288905/nosql-databases-genealogy 79
  81. 81. NoSQL Landscape451 group 80
  82. 82. NoSQL 종류 I □Key-Value Stores □Based on DHTs / Amazon’s Dynamo paper □Data model: (global) collection of K-V pairs □Example: Voldemort, Tokyo, Riak, Redis □Column Store □Based on Google’s BigTable paper □Data model: big table, column families □Example: Hbase, Cassandra, Hypertablehttp://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 81
  83. 83. NoSQL 종류 II □Document Store □Inspired by Lotus Notes □Data model: collections of K-V collections □Example: CouchDB, MongoDB □Graph Database □Inspired by Euler & graph theory □Data model: nodes, rels, K-V on both □Example: Neo4J, FlockDBhttp://www.slideshare.net/emileifrem/nosql-east-a-nosql-overview-and-the-benefits-of-graph-databases 82
  84. 84. NoSQL Data Model 비교http://highlyscalable.wordpress.com/2012/03/01/nosql-data-modeling-techniques/ 83
  85. 85. Key-value Stores  Voldemort  Riak  Redis 84
  86. 86. Voldemort AP □ LinkedIn, 2009, Apache 2.0, Java □ Model: Key-Value(Data Model), Dynamo(Distributed Model) □ Main point: Data is automatically replicated and partitioned to multiple servers □ Concurrency Control: MVCC □ Transaction: No □ Data Storage: BDB, MySQL, RAM □ Key Features □ Data is automatically replicated and partitioned to multiple servers □ Simple Optimistic Locking for multi-row updates □ Pluggable Storage Engine □ Multiple read-writes □ Consistent-hashing for data distribution □ Data Versioning □ Major Users: LinkedIn, GILT □ Best Use: Real-time, large-scalehttp://www.slideshare.net/adorepump/voldemort-nosql, http://nosql.findthebest.com/l/5/Voldemort 85
  87. 87. Voldemort Pros / Cons AP Pros Cons  Highly customizable - each layer  Versioning means lots of disk of the stack can be replaced as space being used. needed  Does not support range queries  Data elements are versioned  No complex query filters during changes  All joins must be done in code  All nodes are independent - no  No foreign key constraints SPOF  No triggers  Very, very fast reads  Support can be hard to findhttp://www.slideshare.net/cscyphers/big-data-platforms-an-overview 86
  88. 88. Voldemort : Logical Architecture AP □ Dynamo DHT Implementation LICENSE □ Consistent Hashing, Vector Clocks APACHE 2.0 LANGUAGE HTTP / Sockets Java Conflict resolved API / PROTOCOL at read and write Time HTTP Java Thrift Json, Java String, byte[], Avro Thrift, Avro, Protobuf ProtoBuf CONCURRENCY MVCC Simple Optimistic Locking for multi-row updates, pluggable storage enginehttp://www.project-voldemort.com/voldemort/design.html , http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 87
  89. 89. Voldemort: Physical Architecture Options APhttp://www.project-voldemort.com/voldemort/design.html 88
  90. 90. Riak AP □ Basho Technologies, 2010, Apache 2.0, Erlang, C □ Model: Key-Value(Data Model), Dynamo(Distributed Model) □ Main Point: Fault tolerance □ Protocol: HTTP/REST or custom binary □ Transaction: No □ Data Storage: Plug-in □ Features □ Tunable trade-offs for distribution and replication (N, R, W) □ Pre- and post-commit hooks in JavaScript or Erlang □ Map/Reduce in Javascript and Erlang □ Links & link walking: use it as a graph database □ Secondary indices: but only one at once □ Large object support (Luwak) □ Major Users: Mozilla, GitHub, Comcast, AOL, Ask.com □ Best Uses: high availability □ Example Usage: CMS, text search, Point-of-sales data collection.http://nosql.findthebest.com/l/6/Riak, http://www.slideshare.net/seancribbs/introduction-to-riak-red-dirt-ruby-conf-training 89
  91. 91. Riak Pros / Cons AP Pros Cons  All nodes are equal - no SPOF  Not meant for small, discrete and  Horizontal Scalability numerous datapoints.  Full Text Search  Getting data in is great; getting it  RESTful interface(and HTTP) out, not so much  Consistency level tunable on each  Security is non-existent: "Riak assumes the internal environment is operation trusted"  Secondary indexes available  Conflict resolution can bubble up  Map/Reduce(JavaScript & Erlang to the client if not careful. only)  Erlang is fast, but its got a serious learning curve.http://www.slideshare.net/cscyphers/big-data-platforms-an-overview 90
  92. 92. Riak AP LICENSE APACHE 2.0 LANGUAGE C, Erlang API / PROTOCOL REST HTTP * ProtoBuf Buckets -> K-V “Links” (~relations) Targeted JS Map/Reduce Tunable consistency (one-quorum-all)http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 91
  93. 93. Riak Logical Architecture APhttp://bcho.tistory.com/621, http://basho.com/technology/technology-stack/ 92
  94. 94. Redis CP □ VMWare, 2009, BSD, C/C++ □ Model: Key-Value(Data Model), Master-Slave(Distributed Model) □ Main Point: Blazing fast □ Protocol: Telnet-like □ Concurrency Control: Locks □ Transaction: Yes □ Data Storage: RAM (in-memory) □ Features □ Disk-backed in-memory database □ Currently without disk-swap (VM and Diskstore were abandoned) □ Master-slave replication □ Pub/Sub lets one implement messaging □ Major Users: StackOverflow, flickr, GitHub, Blizzard, Digg □ Best Uses: rapidly changing data, frequently written, rarely read statistical data □ Example Usage: Stock prices. Analytics. Real-time datahttp://nosql.findthebest.com/l/6/Riak, http://www.slideshare.net/seancribbs/introduction-to-riak-red-dirt-ruby-conf-training 93
  95. 95. Redis Pros / Cons CP Pros Cons  Transactional support  Entirely in memory  Blob storage  Master-slave replication (instead  Support for sets, lists and sorted of master-master) sets  Security is non-existent:  Support for Publish- designed to be used in trusted environments Subscribe(Pub-Sub) messaging  Does not support encryption  Robust set of operators  Support can be hard to findhttp://www.slideshare.net/cscyphers/big-data-platforms-an-overview 94
  96. 96. Redis CP K-V store “Data Structures Server” LICENSE Map, Set, Sorted Set, Linked List BSD Set/Queue operations, Counters, Pub-Sub, Volatile keys LANGUAGE ANSI C, C++ API / PROTOCOL *(Many Language) Telnet Like PERSISTENCE 10-100K ops (whole dataset in RAM + VM) In Memory bg snapshots Persistence via snapshotting (tunable fsync freq.) REPLICATIONS Master / Slave Distributed if client supports consistent hashinghttp://redis.io/presentation/Redis_Cluster.pdf, http://www.slideshare.net/quipo/nosql-databases-why-what-and-when 95
  97. 97. Column Families Stores  Cassandra  HBase 96
  98. 98. Cassandra AP □ ASF, 2008, Apache 2.0, Java □ Model: Column(Data Model), Dynamo(Distributed Model) □ Main Point: Best of BigTable and Dynamo □ Protocol: Thrift, Avro □ Concurrency Control: MVCC □ Transaction: No □ Data Storage: Disk □ Features □ Tunable trade-offs for distribution and replication (N, R, W) □ Querying by column, range of keys □ BigTable-like features: columns, column families □ Has secondary indices □ Writes are much faster than reads (!) □ Map/reduce possible with Apache Hadoop □ All nodes are similar, as opposed to Hadoop/HBase □ Major Users: Facebook, Netflix, Twitter, Adobe, Digg □ Best Uses: write often, read less □ Example Usage: banking, finance, logginghttp://nosql.findthebest.com/l/2/Cassandra, 97
  99. 99. Cassandra Pros / Cons AP Pros Cons  Designed to span multiple  No joins datacenters  No referential integrity  Peer to peer communication  Written in Java - quite complex to between nodes administer and configure  No SPOF  Last update wins  Always writeable  Consistency level is tunable at run time  Supports secondary indexes  Supports Map/Reduce  Support range querieshttp://www.slideshare.net/cscyphers/big-data-platforms-an-overview 98
  100. 100. Cassandra AP Data model of BigTable, infrastructure of Dynamo LICENSE APACHE 2.0 LANGUAGE Java PROTOCOL col_name Thrift Avro col_value timestamp PERSISTENCE Column memtable, SSTable CONSISTENCY Tunnable R/W/N xhttp://nosql.findthebest.com/l/2/Cassandra, 99
  101. 101. Cassandra AP Data model of BigTable, infrastructure of Dynamo LICENSE APACHE 2.0 LANGUAGE Java super_column_name PROTOCOL col_name col_name Thrift … Avro col_value col_value timestamp timestamp PERSISTENCE memtable, SSTable CONSISTENCY Tunnable R/W/N xhttp://nosql.findthebest.com/l/2/Cassandra, 100
  102. 102. Cassandra AP Data model of BigTable, infrastructure of Dynamo LICENSE APACHE 2.0 Column Family LANGUAGE Java super_column_name PROTOCOL col_name col_name Thrift row_key … Avro col_value col_value timestamp timestamp PERSISTENCE memtable, SSTable CONSISTENCY Tunnable R/W/N xhttp://nosql.findthebest.com/l/2/Cassandra, 101
  103. 103. Cassandra AP Data model of BigTable, infrastructure of Dynamo LICENSE APACHE 2.0 Super Column Family LANGUAGE Java super_column_name super_column_name PROTOCOL col_name col_name col_name col_name Thrift row_key … Avro … … col_value col_value col_value col_value timestamp timestamp timestamp timestamp PERSISTENCE memtable, SSTable keyspace.get (“column_family”, key, [“super_column”], “column”) CONSISTENCY Tunnable R/W/N xhttp://nosql.findthebest.com/l/2/Cassandra, 102
  104. 104. Cassandra AP Data model of BigTable, infrastructure of Dynamo LICENSE APACHE 2.0 Super Column Family LANGUAGE Java super_column_name super_column_name PROTOCOL col_name col_name col_name col_name Thrift row_key … Avro … … col_value col_value col_value col_value timestamp timestamp timestamp timestamp PERSISTENCE memtable, SSTable keyspace.get (“column_family”, key, [“super_column”], “column”) CONSISTENCY A B Random Partitioner (MD5) Tunnable ALL OrderPreservingPartitioner R/W/N P2P C ONE F Gossip QUORUM x E D Range Scans, Fulltext Index(Solandra)http://nosql.findthebest.com/l/2/Cassandra, 103
  105. 105. Cassandra - Data Model AP HashTable Object LICENSE APACHE 2.0 HashKey LANGUAGE Java PROTOCOL Thrift Avro PERSISTENCE memtable, SSTable CONSISTENCY Tunnable R/W/Nhttp://javamaster.wordpress.com/2010/03/22/apache-cassandra-quick-tour/ 104
  106. 106. Cassandra - Data Model AP • Column {name:"emailAddress", value:"cassandra@apache.org"} • Name-Value 구조체 {name:"age", value:"20"} UserProfile={ • Column Family Cassandra={emailAddress:”casandra@apache.org”, age:”20”} TerryCho={emailAddress:”terry.cho@apache.org”, • Column들의 집합 gender:”male”} • Hash-Key Column리스트 Cath= { emailAddress:”cath@apache.org” , age:”20”, gender:”female”, address:”Seoul”} } • Super-Column • Column안에 Column 포함 {name:”username” • ex) username->{firstname, value: firstname{name:”firstname”,value=”Terry”} lastname} value: lastname{name:”lastname”,value=”Cho”} } • Super-Column Family UserList={ Cath:{ • Column Family안에 Column username:{firstname:”Cath”,lastname:”Yoon”} Family 포함 address:{city:”Seoul”,postcode:”1234”}} Terry:{ username:{firstname:”Terry”,lastname:”Cho”} account:{bank:”hana”,accounted:”1234”}} }http://javamaster.wordpress.com/2010/03/22/apache-cassandra-quick-tour/ 105
  107. 107. Cassandra – Use Case: eBay AP □ A glimpse on eBay’s Cassandra deployment □ Dozens of nodes across multiple clusters □ 200 TB+ storage provisioned □ 400M+ writes & 100M+ reads per day, and growing □ #1: Social Signals on eBay product & item pages □ #2: Hunch taste graph for eBay users & items □ #3: Time series use cases (many): □ Mobile notification logging and tracking □ Tracking for fraud detection □ SOA request/response payload logging □ ReadLaser server logs and analytics □ Cassandra meets requirements □ Need Scalable counters □ Need real(or near) time analytics on collected social data □ Need good write performance □ Reads are not latency sensitivehttp://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376 106
  108. 108. HBase CP □ ASF, 2010, Apache 2.0, Java □ Model: Column(Data Model), Bigtable(Distributed Model) □ Main Point: Billions of rows X millions of columns □ Protocol: HTTP/REST (also Thrift) □ Concurrency Control: Locks □ Transaction: Local □ Data Storage: HDFS □ Features □ Query predicate push down via server side scan and get filters □ Optimizations for real time queries □ A high performance Thrift gateway □ HTTP supports XML, Protobuf, and binary □ Rolling restart for configuration changes and minor upgrades □ Random access performance is like MySQL □ A cluster consists of several different types of nodes □ Major Users: Facebook □ Best Use: random read write to large database □ Example Usage: Live messaginghttp://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/10/HBase 107
  109. 109. HBase Pros / Cons CP Pros Cons  Map/Reduce support  Secondary indexes generally not  More of a CA approach and AP supported  Supports predicate push down for  Security is non-existent performance gains  Requires a Hadoop infrastructure  Automatic partitioning and to function rebalancing of regins  Data is stored in a sorted order(not indexed)  RESTful API  Strong and vibrant ecosystemhttp://www.slideshare.net/cscyphers/big-data-platforms-an-overview 108
  110. 110. HBase vs. BigTable Terminology CP  Cons HBase BigTable Table Table Region Tablet RegionServer Tablet Server MemStore Memtable Hfile SSTable WAL Commit Log Flush Minor compaction Minor Compaction Merging compaction Major Compaction Major compaction HDFS GFS MapReduce MapReduce ZooKeeper Chubbyhttp://www.slideshare.net/cscyphers/big-data-platforms-an-overview 109
  111. 111. HBase: Architecture CP • Zookeeper as Coordinator (instead of Chubby) LICENSE • Hmaster: support for multiple masters APACHE 2.0 • HDFS, S3, S3N, EBS (with Gzip/LZO CF compression) LANGUAGE • Data sorted by key but evenly distributed across the cluster Java PROTOCOL REST, HTTP Thrift, Avro PERSISTENCE memtable, SSTablehttp://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html 110
  112. 112. HBase: Architecture - WAL CP LICENSE APACHE 2.0 LANGUAGE Java PROTOCOL REST, HTTP Thrift, Avro PERSISTENCE memtable, SSTablehttp://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html 111
  113. 113. HBase: Architecture - WAL CP LICENSE APACHE 2.0 LANGUAGE Java PROTOCOL REST, HTTP Thrift, Avro PERSISTENCE memtable, SSTablehttp://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html 112
  114. 114. HBase - Usecase: Facebook Message Service CP □ New Message Service □ combines chat, SMS, email, and Messages into a real-time conversation □ Data pattern  A short set of temporal data that tends to be volatile  An ever-growing set of data that rarely gets accessed □ chat service supports over 300 million users who send over 120 billion messages per month □ Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. □ HBase meets our requirements □ Has a simpler consistency model than Cassandra. □ Very good scalability and performance for their data patterns. □ Most feature rich for their requirements: auto load balancing and failover, compression support, multiple shards per server, etc. □ HDFS, the filesystem used by HBase, supports replication, end-to-end checksums, and automatic rebalancing. □ Facebooks operational teams have a lot of experience using HDFS because Facebook is a big user of Hadoop and Hadoop uses HDFS as its distributed file system.http://nosql.findthebest.com/l/10/HBase, http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html 113
  115. 115. Hbase – Usecase: Adobe CP □ When we started pushing 40 million records, Hbase squeaked and cracked. After 20M inserts it failed so bad it wouldn’t respond or restart, it mangled the data completely and we had to start over.  HBase community turned out to be great, they jumped and helped us, and upgrading to a new HBase version fixed our problems □ On December 2008, Our HBase cluster would write data but couldn’t answer correctly to reads.  I was able to make another backup and restore it on a MySQL cluster □ We decided to switch focus in the beginning of 2009. We were going to provide a generic, real-time, structured data storage and processing system that could handle any data volume.http://hstack.org/why-were-using-hbase-part-1 114
  116. 116. Document Stores  MongoDB  CouchDB 115
  117. 117. MongoDB CP □ 10gen, 2009, AGPL, C++ □ Model: Document(Data Model), Bigtable(Distributed Model) □ Main Point: Full Index Support, Querying, Easy to Use □ Protocol: Custom, binary(BSON) □ Concurrency Control: Locks □ Transaction: No □ Data Storage: Disk □ Features □ Master/slave replication (auto failover with replica sets) □ Sharding built-in □ Uses memory mapped files for data storage □ GridFS to store big data + metadata (not actually an FS) □ Major Users: Craigslist, Foursquare, SAP, MTV, Disney, Shutterfly, Intuit □ Example Usage: CMS system, comment storage, voting □ Best Use: dynamic queries, frequently written, rarely read statistical datahttp://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/10/HBase 116
  118. 118. MongoDB Pros / Cons CP Pros Cons  Auto-sharding  Does not support JSON: BSON  Auto-failover instead  Update in place  Master-slave replication  Spatial index support  Has had some growing pains(e.g.  Ad hoc query support Foursquare outage)  Any field in Mongo can be  Not RESTful by default indexed  Failures require a manual  Very, very popular (lots of database repair operation(similar production deployments) to MySQL)  Very easy transition from SQL  Replication for availability, not performancehttp://www.slideshare.net/cscyphers/big-data-platforms-an-overview 117
  119. 119. MongoDB CP LICENSE AGPL v3 LANGUAGE C++ PROTOCOL REST/BSON PERSISTENCE B+ Trees, Snapshots CONCURRENCY In-place Updates REPLICATION master-slave replica setshttp://www.slideshare.net/quipo/nosql-databases-why-what-and-when 118
  120. 120. MongoDB - Architecture CP • mongod: the core database process LICENSE • mongos: the controller and query router for sharded clusters AGPL v3 LANGUAGE C++ PROTOCOL REST/BSON PERSISTENCE B+ Trees, Snapshots CONCURRENCY In-place Updates REPLICATION master-slave replica setshttp://sett.ociweb.com/sett/settAug2011.html, http://www.infoq.com/articles/mongodb-java-php-python 119
  121. 121. CouchDB AP □ ASF, 2005, Apache 2.0, Erlang □ Model: Document(Data Model), Notes(Distributed Model) □ Main Point: DB consistency, easy to use □ Protocol: HTTP, REST □ Concurrency Control: MVCC □ Transaction: No □ Data Storage: Disk □ Features □ ACID Semantics □ Map/Reduce Views and Indexes □ Distributed Architecture with Replication □ Built for Offline □ Major Users: LotsOfWords.com, CERN, BBC, □ Example Usage: CRM, CMS systems □ Best Use: accumulating, occasionally changing data with pre-defined querieshttp://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis, http://nosql.findthebest.com/l/3/CouchDB 120
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×