Webinar - Navigating the NoSQL Landscape (Which freaking database should I use?)


Published on

Andrew Oliver of OS Integrators will explore the technical reasons you might select one of these NoSQL databases, the different types of databases available, their correct use and 'quintessential' use cases for them. Dipti Borkar of Couchbase will dive deeper into document databases and the features of Couchbase Server 2.0.

  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Most of you are probably familiar with the table layout. A table is defined with a set of column. And each record in the table conforms to the schema. If you wish to capture different data in the future, the table schema must be changed using the alter table statement. Typically data is normalized in the 3 rd normal form reduce duplication. Large tables are split into smaller tables.using foreign keys
  • The data is modeled for the application code and not for the database.
  • JSON support – natively stored as json, whne you build an app, there is not conversion required. New doc viewing , editing capability. Indexing and querying – look inside your json, build views and query for a key, for ranges or to aggregate data Incremental mapreduce – powers indexing. Build complex views over your data. Great for real-time analytics XDCR – replicate information from one cluster to another cluster
  • Bulletize the text. Make sure the builds work.
  • Could be used to integrate with other offerings like graph databases, relational databases, etc
  • Webinar - Navigating the NoSQL Landscape (Which freaking database should I use?)

    1. 1. Which FreakingDatabase Should I Use? Andrew C. Oliver Open Software Integrators www.osintegrators.com @osintegrators
    2. 2. Andrew C. Oliver • Programming since I was about 8 • Java since ~1997 • Founded POI project (currently hosted at Apache) with Marc Johnson ~2000 o Former member Jakarta PMC o Emeritus member of Apache Software Foundation • Joined JBoss ~2002 • Former Board Member/current helper/lifetime member: Open Source Initiative (http://opensource.org) • Column in InfoWorld: http://www.infoworld.com/author-bios/andrew-oliver o I make fanboys cry.@osintegrators 2
    3. 3. Open Software Integrators • Founded Nov 2007 by Andrew C. Oliver (me) o in Durham, NC Revenue and staff has at least doubled every year since 2009. • New office (2012) in Chicago, IL o were hiring mid to senior level as well as UI Developers (JQuery, Javascript, HTML, CSS) o up to 25% travel, salary + bonus, 401k, health, etc etc o preferred: Java, Tomcat, JBoss, Hibernate, Spring, RDBMS, JQuery o nice to have: Hadoop, Neo4j, CouchBase, Ruby, at least one Cloud platform@osintegrators 3
    4. 4. Overview • Why not just use the RDBMS for everything? • Operational vs Analytical • Key Value • Column Family • Document • Graph • Hadoop? • Convergence of "clustered filesystems" and "databases"@osintegrators 4
    5. 5. Why not "just use" the RDBMS for everything?
    6. 6. Before we begin... • Lets handle the Elephant or rather the teddy bears in the room:@osintegrators 6
    7. 7. The CAP theorem@osintegrators 7
    8. 8. RDBMS CAP characteristics • Great at consistency • Okay at availability • Not so great at partition tolerance...@osintegrators 8
    9. 9. Single process model • Lots of servers with many connections to few servers.@osintegrators 9
    10. 10. Multiprocess Model Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster Manager Data Manager Cluster Manager@osintegrators 10
    11. 11. Historical Scalability • 10mb disks were "big" • Scalability meant more disks, controllers and possilby CPUs • CPUs went from 4.77 Mhz to 3.4ghz • Disks went from 64kps@70ms to 6gb/s • Network speeds went from under 4mb to gigabit to bonded gigabit and beyond. • Disk speeds for a long time didnt keep up with CPU...@osintegrators 11
    12. 12. The Mathematical model • RDBMS is based on "Relational Algebra" which is just an extension of basic "set theory" • Not every problem is a set problem: "direct path" or "which thing contains this other thing which has this other thing" (foaf) • Sometimes relationships are as important as the data • Sometimes data is even simpler than the relational model but needs higher levels of availability, etc. • One size never really did fit all@osintegrators 12
    13. 13. Data Complexity@osintegrators 13
    14. 14. Datarrhea • Yes Ive already registered that ;-) • The cheapness of storing data has yielded more demand o economics predicted this • Moores law ended while you slept o Intel says next year (but when did CPU speeds last double?) • Massive parallelization is the most feasible way to get at it (counter trended with an explosion in disk speeds)@osintegrators 14
    15. 15. ...but • If o your data is tabular; o fits cleanly in a relational model; o you arent having scalability issues; o you dont have a large dataset; or o a dataset/problem that lends itself to massive parallelization... • you can probably stick with your RDBMS for now o ...and probably arent at this conference anyhow.@osintegrators 15
    16. 16. JPA/RDBMS Tables Example PersonID Firstname Lastname CompanyID 2 Andy Oliver 3 CompanyID Name City State 3 Open Software Durham NC Integrators PhoneNumber Type PersonID 919.627.1236 google 2 919.321.0119 work 2@osintegrators 16
    17. 17. Operational vs Analytical • One DB type is unlikely to be well suited for all of your problems. • The system doing "short and sweet" "lightweight" transactions is your operational system. • The system doing long running reports and generating charts and graphs and statistics is your analytical system. • There is also search. There are recommendation engines, etc.@osintegrators 17
    18. 18. Other types of databases
    19. 19. Key-Value Stores • Examples: Couchbase 1.8, Cassandra o also: Gemfire, Infinispan (distributed caches) • Constant Time O(1) - Lookup by key • Good enough for "right now" stock quotes • Usually combined with an index for search, but the structure isnt inherently indexed. • Generally works well with Map Reduce. • Extremely scalable, easy to partition@osintegrators 19
    20. 20. Column Family / Big Table • Many Key-Value support "column families" o Cassandra • Some we designed this way o HBase • Keys and values become composite • essentially a hashmap with a multi-dimensional array o each column is a row of data • map-reduce friendly • Stock quote with time ranges@osintegrators 20
    21. 21. HBase Example Row First Last Phone Phone Company City State key name name number type 5bfbd4a0 -d02a- Open 11e1- 919-627- Andy Oliver Software Durham NC google 9b23- 1236 0800200 Integrators c9a66 7b2435c 0-d02a- Open 11e1- 919-321- Andy Oliver Software Durham NC work 9b23- 0119 0800200 Integrators c9a66@osintegrators 21
    22. 22. Document databases • Many developers think these are the "holy grail" since the fit nicely with object-oriented programming. • Couchbase 2.0, CouchDB, MongoDB • JSON documents • One way to think of this is a Key-Value store that understands the values. • Not as map-reduce friendly, larger datasets require indexes. • clearly rest services, operational store@osintegrators 22
    23. 23. Document databases • JSON document: { "firstname" : "Andy", "lastname" : "Oliver", "company" : "Open Software Integrators", "location" : { "city" : "Durham", "state" : "NC" }, "phone" : [ { "number" : "123 456 7890", "type" : "mobile" }, { "number" : "123 654 1234", "type" : "work" } ] }@osintegrators 23
    24. 24. Graph Databases • Based on Graph Theory • Less about volume of the data and more about complexity • Many are transactional o often the transactions are "more correct" than those offered by a relational database. • FOAF, direct path operations are easy o very complicated/inefficient in RDBMS • Usually paired with an index for search@osintegrators 24
    25. 25. Design: RDBMS vs Graph@osintegrators 25
    26. 26. Neo4j Graph Example City: LOCATED LOCATED City: Company: Open Software Chicago Durham Integrators State: IL State: NC WORKS FOR FOUNDED RESIDES Firstname: Andrew Lastname: Oliver HAS HAS HAS Phone Number: 919.627.1236 Phone Number: 919.321.0119 Type : googlevoice Type : work Note the extra relationships and details here - graph databases are just fun and easy to understand.@osintegrators 26
    27. 27. Where does Hadoop fit? • NoSQL • Software Framework (lots of pieces/lots of choices): o Pig - scripting language used to quickly write MapReduce code to handle unstructured sources o Hive - facilitates structure for the data o HCatalog - provides inter-operability between these internal systems o HBase - Bigtable-type database o HDFS - Hadoop file system • Excellent choice for data processing and data analysis • MapReduce@osintegrators 27
    28. 28. Convergence of Filesystems and Databases • Hadoop HDFS is...a distributed filesystem • So is Gluster, Ceph, GFS, etc • Hadoop can use Ceph or Gluster in place of HDFS@osintegrators 28
    29. 29. Other Derivatives • Triplestores o Apache Jenna • OODBMS /ORDMS o Cache@osintegrators 29
    30. 30. Things you may consider • Persistence o Asynch / Synch • Replication • Availability • Transactions / Consistancy • "Locality" • Language • Resources o http://en.wikipedia.org/wiki/Comparison_of_structured_storage_software o http://sevenweeks.org/@osintegrators 30
    31. 31. Conclusions • RDBMS may not scale to your needs • Your data may not map efficiently to tables • Key Value Store - data by key, fast, scalable, cant handle complex data • Column Family/Big Table - fast, scalable, denormalized, map reduce, good for series, not efficient for complex data • Document - a good operational system, not your analytical, moderately scalable, matches OO • Graph - great for complex data, transactional, less scalable Filesystems and "databases" are converging@osintegrators 31
    32. 32. Thank you for attending! Andrew C. OliverOpen Software Integrators www.osintegrators.com @osintegrators
    33. 33. Introduction toDocument Databases and Couchbase Dipti Borkar Director, Product Management
    34. 34. Couchbase Server NoSQL Document NoSQL Database Database 2.0
    35. 35. Couchbase Server - Core Capabilities Easy Consistent High Scalability PE R F O R M A NC E Performance Grow cluster without Consistent sub-millisecond application changes, without read and write response times downtime with a single click with consistent high throughput Always On JSON Flexible Data 24x365 Model JSON JSO N JSON JSON No downtime for software JSON document model with upgrades, hardware no fixed schema. maintenance, etc.
    36. 36. Relational vs Document data model C1 C2 C3 C4 { JSON JSON } JSON Relational data model Document data model Highly-structured table organization Collection of complex documents with with rigidly-defined data formats and arbitrary, nested data formats and record structure. varying “record” format.
    37. 37. Making a Change Using RDBMS User Table Photo Table Country Table Country TE CountryUser User Photo Comme ID Country Country First Last Zip ID L3 ID ID nt ID name ID 001 1 Dipti Borkar 94040 001 2 d043 NYC 001 USA 007 2 b054 Bday 002 UK 2 Joe Smith 94040 001 001 5 c036 Miami 003 Argentina Dodso 001 133 3 Ali 94040 7 d072 Sunset n 004 Australia 133 002 5002 e086 Spain 4 Sarah Gorin NW1 005 Aruba Status Table 5 Bob Young 30303 001 Country 006 Austria Status ID User ID Text ID 6 Nancy Baker 10010 001 134 007 Brazil 1 a42 At conf 007 008 Canada 7 Ray Jones 31311 001 4 b26 excited 008 V5V3 5 c32 hockey 009 Chile 8 Lee Chen 008 M 001 12 d83 Go A’s • 005 • • . 5000 e34 sailing • . • • . Affiliations Table 130 Portugal Country User Affl ID Affl ID ID Name 131 Romania50000 Doug Moore 04252 001 001 2 a42 Cal SW19 001 132 Russia50001 Mary White 002 4 b96 USC 5 001 133 Spain 7 c14 UW50002 Lisa Clark 12425 001 002 8 e22 Oxford 134 Sweden
    38. 38. Making the Same Change with a DocumentDatabase { “ID”: 1, “FIRST”: “Dipti”, “LAST”: “Borkar”, “ZIP”: “94040”, “CITY”: “MV”, “STATE”: “CA”, “STATUS”: ,} { “TEXT”: “At Conf” } “GEO_LOC”: “134” }, } “COUNTRY”: ”USA” JSON Just add information to a document
    39. 39. Couchbase Server 2.0 Architecture 8092 11211 11210 Query API Memcapable 1.0 Memcapable 2.0 Moxi Query Engine REST management API/Web UI vBucket state and replication manager Memcached Global singleton supervisor Rebalance orchestrator Configuration manager Node health monitor Process monitor Heartbeat Couchbase EP Engine Data Manager Cluster Manager storage interface New Persistence Layer http on each node one per cluster Erlang/OTP HTTP Erlang port mapper Distributed Erlang 8091 4369 21100 - 21199
    40. 40. Couchbase“The basics”
    41. 41. Basic Operation App Server 1 App Server 2 COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library Cluster Map Cluster Map READ/WRITE/UPDATE Server 1 Server 2 Server 3 • Docs distributed evenly across Active Active Active servers Doc 5 Doc Doc 4 Doc Doc 1 Doc • Each server stores both active and replica docs Doc 2 Doc Doc 7 Doc Doc 2 Doc – Only one server active at a time Doc 9 Doc Doc 8 Doc Doc 6 Doc • Client library provides app with simple interface to database REPLICA REPLICA REPLICA • Cluster map provides map to which server doc is on Doc 4 Doc Doc 6 Doc Doc 7 Doc – App never needs to know Doc 1 Doc Doc 3 Doc Doc 9 Doc • App reads, writes, updates docs Doc 8 Doc Doc 2 Doc Doc 5 Doc • Multiple app servers can access same document at same time Couchbase Server ClusterUser Configured Replica Count = 1
    42. 42. Add Nodes to Cluster App Server 1 App Server 2 COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library Cluster Map Cluster Map READ/WRITE/UPDATE READ/WRITE/UPDATE • Two servers added with Server 1 Server 2 Server 3 Server 4 Server 5 one-click operation Active Active Active Active Active • Docs automatically Doc 5 Doc Doc 4 Doc Doc 1 Doc rebalance across cluster Doc 2 Doc Doc 7 Doc Doc 2 Doc – Even distribution of docs – Minimum doc Doc 9 Doc Doc 8 Doc Doc 6 Doc movement REPLICA REPLICA REPLICA REPLICA REPLICA • Cluster map updated Doc 4 Doc Doc 6 Doc Doc 7 Doc • App database calls now distributed Doc 1 Doc Doc 3 Doc Doc 9 Doc over larger number Doc 8 Doc Doc 2 Doc Doc 5 Doc of servers Couchbase Server ClusterUser Configured Replica Count = 1
    43. 43. Fail Over Node App Server 1 App Server 2 COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library COUCHBASE Client Library Cluster Map Cluster Map Server 1 Server 2 Server 3 Server 4 Server 5 • App servers accessing docs Active Active Active Active Active • Requests to Server 3 fail Doc 5 Doc Doc 4 Doc Doc 1 Doc Doc 9 Doc Doc 6 Doc • Cluster detects server failed – Promotes replicas of docs to Doc 2 Doc Doc 7 Doc Doc 3 Doc Doc 8 Doc Doc active – Updates cluster map Doc 1 Doc 3 • Requests for docs now go to REPLICA REPLICA REPLICA REPLICA REPLICA appropriate server Doc 4 Doc Doc 6 Doc Doc 7 Doc Doc 5 Doc Doc 8 Doc • Typically rebalance would follow Doc 1 Doc Doc 3 Doc Doc 9 Doc Doc 2 Doc Couchbase Server ClusterUser Configured Replica Count = 1
    44. 44. New in 2.0 JSON support Indexing and Querying JSON JSON JSO JSON N JSON Incremental Map Reduce Cross data center replication
    45. 45. Cluster wide - Indexing and Querying App Server 1 App Server 2 COUCHBASE Client COUCHBASE Client COUCHBASE Client COUCHBASE Client Library Library Library Library Cluster Map Cluster Map Query Server 1 Server 2 Server 3 • Indexing work is distributed Active Active Active amongst nodes Doc 5 Doc Doc 5 Doc Doc 5 Doc • Large data set possible Doc 2 Doc Doc 2 Doc Doc 2 Doc • Parallelize the effort Doc 9 Doc • Each node has index for data stored Doc 9 Doc Doc 9 Doc on it REPLICA REPLICA REPLICA • Queries combine the results from Doc 4 Doc required nodes Doc 4 Doc Doc 4 Doc Doc 1 Doc Doc 1 Doc Doc 1 Doc Doc 8 Doc Doc 8 Doc Doc 8 Doc Couchbase Server ClusterUser Configured Replica Count = 1
    46. 46. Cluster wide - XDCR Server 1 Server 2 Server 3 Active Active Active Couchbase Server Cluster Doc Doc Doc NY DATA CENTER Doc 2 Doc Doc Doc 9 Doc DocRAM RAM RAM Doc Doc Doc Doc Doc Doc Doc Doc Doc DISK DISK DISK Server 1 Server 2 Server 3 Active Active Active Doc Doc Doc Doc 2 Doc Doc Doc 9 Doc Doc RAM RAM RAM Couchbase Server Cluster Doc Doc Doc Doc Doc Doc Doc Doc Doc SF DATA CENTER DISK DISK DISK
    47. 47. Full Text Search Integration• Elastic Search is good for ad-hoc queries and faceted browsing• Couchbase adapter uses XDCR to push mutations to ES• Couchbase ES Adapter is cluster-aware ElasticSearch Unidirectional Cross Data Center Replication
    48. 48. Couchbase Server Admin Console
    49. 49. Use cases
    50. 50. Data driven use cases • Support for unlimited data growth • Data with non-homogenous structure • Need to quickly and often change data structure • 3rd party or user defined structure • Variable length documents • Sparse data records • Hierarchical data
    51. 51. Performance driven use cases • Low latency matters • High throughput matters • Large number of users • Unknown demand with sudden growth of users/data • Predominantly direct document access • Workloads with very high mutation rate per document
    52. 52. Use Case ExamplesWeb app or Use- Couchbase Solution Examplecase CustomerContent and Couchbase document store + Elastic McGraw-Hill…Metadata SearchManagementSystemSocial Game or Couchbase stores game and player data Zynga…Mobile AppAd Targeting Couchbase stores user information for fast AOL… accessUser Profile Store Couchbase Server as a key-value store TuneWiki…Session Store Couchbase Server as a key-value store Concur….High Availability Couchbase Server as a memcached tier Orbitz…Caching Tier replacementChat/Messaging Couchbase Server DOCOMO…
    53. 53. Q&A
    54. 54. Andrew Oliver Dipti Borkarinfo@osintegrators.com dipti@couchbase.comwww.osintegrators.com www.couchbase.com