Big Data, NoSQL with MongoDB and Cassasdra

  • 3,917 views
Uploaded on

Presentation on Big Data, NoSQL with MongoDB and Cassasdra

Presentation on Big Data, NoSQL with MongoDB and Cassasdra

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
3,917
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
180
Comments
0
Likes
4

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Big Data and NoSQL with MongoDB & Cassandra NOSQL Intro with MongoDB and Cassandra 1
  • 2. - Brian Enochson - SW Engineer who has worked as designer / developer on NOSQL (Mongo, Cassandra, Hadoop) - Specialize in SW Development, architecture and training     Brian Enochson brian.enochson@gmail.com Twitter @benochso Google Plus https://plus.google.com/+BrianEnochson NOSQL Intro with MongoDB and Cassandra 2
  • 3. • • • • • Presentation Intro Introduction to Big Data Introduction to NoSQL Relational Database to NoSQL technology contrast & compare NoSQL landscape NOSQL Intro with MongoDB and Cassandra 3
  • 4. • • • • • • • Introduction to MongoDB MongoDB Components, capabilities and common use cases Json & BsON Documents, collections, references and Mongo ID Querying Data Modeling/Schema Design Replication & Sharding NOSQL Intro with MongoDB and Cassandra 4
  • 5. • • • • • • Cassandra Architecture Data Model Data Modeling Application Development Wrap-up and final Q & A NOSQL Intro with MongoDB and Cassandra 5
  • 6.   http://www.cloudtweaks.com/2014/01/hand-writing-data-data-everywhere-but-lets-juststop-and-think/ NOSQL Intro with MongoDB and Cassandra 6
  • 7.  • Why are database like Mongo or Cassandra needed? To understand one needs to look at • the history of databases • How systems were built in the past • Then examine modern applications • Web scale • Data acquisition • Other factors like cost of H/W NOSQL Intro with MongoDB and Cassandra 7
  • 8. • • • • • • 1960’s – Hierarchical and Network type (IMS and CODASYL) 1970’s – Beginnings of theory behind relational model. Codd 1980’s – Rise of the relational model. SQL. E/R Model (Chen) 1990’s – Access/Excel and MySQL. ODMS began to appear 2000;’s – Two forces; large enterprise and open source. Google and Amazon. CAP Theorem (more on that to come…) 2010’s – Immergence of NoSQL as an industry player and viable alternative NOSQL Intro with MongoDB and Cassandra 8
  • 9. • Developers today are faced with Internet scale 100,000’s of users Low cost of storage Increased processing power Ability to capture (and need) of millions of events. Caching solves it to an extent but brings other complexities • Real-time • Need to scale out and not up. (add infinite number of low cost machines vs. replace with a more powerful machine). • • • • • Cost • Let’s not forget for enterprise DB’s Internet scale can become expensive • Open source DB’s may solve license cost, but don’t ignore operational costs NOSQL Intro with MongoDB and Cassandra 9
  • 10.  Some facts from http://www.storagenewsletter.com/rubriques/m arket-reportsresearch/ibm-cmo-study/ Approximately 90 percent of all the real-time information being created today is unstructured data Every day we create 2.5 quintillion (10 to the 18th) bytes of data (this is 30 zeroes!!) 90 percent of the world's data today has been created in the last two years alone NOSQL Intro with MongoDB and Cassandra 10
  • 11. • Relational • Divide into tables, relate into foreign keys, DB constraints, normalized data, the Interface is SQL • NoSQL • Store in schemaless format, redundancy encouraged, application access determines the storage format (your queries).Interface varies and is optimized for the implementation, no forced DB constraints. NOSQL Intro with MongoDB and Cassandra 11
  • 12. Luckily, due to the large number of compromises made when attempting to scale their existing relational databases, these tradeoffs were not so foreign or distasteful as they might have been.  Greg Burd https://www.usenix.org/legacy/publications /login/2011-10/openpdfs/Burd.pdf NOSQL Intro with MongoDB and Cassandra 12
  • 13.    Eventual consistency Application has increased responsibility such as maintain consistency & handle transactions Store redundant data NOSQL Intro with MongoDB and Cassandra 13
  • 14. Driving force in requiring new technology is often referred to as the “3 V’s”. • • • Volume – amount of data Variety – range of data types and sources Velocity – speed of data in and out NOSQL Intro with MongoDB and Cassandra 14
  • 15. NoSQL != Big Data   NoSQL products were created to help solve the big data problem. Big data is a much larger problem than just storage. Analysis tools like Hadoop, messaging systems like Kafka, real time processing engines like Storm and machine learning (Mahout) all help solve the big data problem. NOSQL Intro with MongoDB and Cassandra 15
  • 16. Document DB   Wide Column– Column Family   Cassandra, HBASE, Amazon SimpleDB Key Value  • Riak, Redis, DynamoDB, Voldemort, MemcacheDB Graph  • Neo4J, OrientDB Search (search can also be a persistence store)  •  MongoDB, CouchDB, Lucene, Solr, ElasticSearch Many many many, many more! (http://nosql-database.org/) NOSQL Intro with MongoDB and Cassandra 16
  • 17.   Choosing the right NoSQL type and eventual product depends on… Type of Data • • • • • • • •    One key and a lot of data? Schema variance High volume of data? Storing, media, blobs, Document oriented? Tracking relationships? Combination? Multi-Datacenter Type of Access Volumes of Data (there is big data and there is BIG DATA) Need/want support/services/training NOSQL Intro with MongoDB and Cassandra 17
  • 18. • ACID • CAP Theorem • BASE NOSQL Intro with MongoDB and Cassandra 18
  • 19. PROBABLY HAVE HEARD OF ACID • Atomic – All or None • Consistency – What is written is valid • Isolation – One operation at a time • Durability – Once committed to the DB, it stays This is the world we have lived in for a long time… NOSQL Intro with MongoDB and Cassandra 19
  • 20.   Many may have heard this one CAP stands for Consistency, Availability and Partition Tolerance • Consistency –like the C in ACID. Operation is all or nothing, • Availability – service is available. • Partition Tolerance – No failure other than complete network failure causes system not to respond  ** http://www.cs.berkeley.edu/~brewer/cs262b2004/PODC-keynote.pdf NOSQL Intro with MongoDB and Cassandra 20
  • 21. In Mongo terms you can have 2 of 3. Availability, Partition-Tolerance or Eventual Consistency. NOSQL Intro with MongoDB and Cassandra 21
  • 22. NOSQL Intro with MongoDB and Cassandra 22
  • 23. • So we are talking about large amounts of data • High velocity of acquisition • A lot of variety that we need to store. Will worry about it later how to handle (or not) • Need to scale and not break the bank • Want the database to support agile, not hinder NOSQL Intro with MongoDB and Cassandra 23
  • 24. • Maybe consider going relational if • Highly transactional (FoundationDB?) • Business Intelligence Systems (Hadoop may make this not true) • Don’t be fooled by fear of losing ACID…. http://highscalability.com/blog/2013/5/1/myth-eric-brewer-onwhy-banks-are-base-not-acid-availability.html NOSQL Intro with MongoDB and Cassandra 24
  • 25. And now let’s look at MongoDB NOSQL Intro with MongoDB and Cassandra 25
  • 26. http://db-engines.com/en/ranking_definition NOSQL Intro with MongoDB and Cassandra 26
  • 27. Few • • • • • • high level points Document Oriented Storage format is JSON (actually BSON) Replication built in Master / slave architecture Strong querying support Name from "humongous" NOSQL Intro with MongoDB and Cassandra 27
  • 28. • Open Source • Schemaless • Scalable • Document Level Atomicity • Easy Installation • Relatively Ease Of Use • Great (!!!!) Documentation NOSQL Intro with MongoDB and Cassandra 28
  • 29. • No cross document transactions • No joins • Replication – master / slave • Sharding NOSQL Intro with MongoDB and Cassandra 29
  • 30.  - * Credit – Dwight Merriman, Founder and CEO – MongoDB (was 10Gen) NOSQL Intro with MongoDB and Cassandra 30
  • 31.  Master Slave and Secondary Reads ** http://docs.mongodb.org/manual/core/replication-introduction/ NOSQL Intro with MongoDB and Cassandra 31
  • 32.  Primary     Receives all write requests Replica set can only have on primary Mongo stored all changes in oplog Secondary Replicates primary oplog  Clients can prefer to read from secondaries  If primary goes down a new primary is elected (after 10 seconds no response)  NOSQL Intro with MongoDB and Cassandra 32
  • 33.  http://docs.mongodb.org/manual/core/sharding-introduction/ NOSQL Intro with MongoDB and Cassandra 33
  • 34.  Shards   Store the data, normally in production each shard is a replica set Routers  Routes client operations to shards based on shard key, can have more than one for availability  Shard key is range based or hashed  Config Servers   Contains cluster metadata Production there are 3 config servers NOSQL Intro with MongoDB and Cassandra 34
  • 35.  • •  At its simplest form, Mongo is a document oriented database MongoDB stores all data in documents, which are JSON-style data structures composed of field-andvalue pairs. MongoDB stores documents on disk in the BSON serialization format. BSON is a binary representation of JSON documents. BSON contains more data types than does JSON. ** For in-depth BSON information, see bsonspec.org. NOSQL Intro with MongoDB and Cassandra 35
  • 36.       { "_id" : "52a602280f2e642811ce8478", "ratingCode" : "PG13", "country" : "USA", "entityType" : "Rating” } NOSQL Intro with MongoDB and Cassandra 36
  • 37. NOSQL Intro with MongoDB and Cassandra 37
  • 38.      Documents have the following rules: The maximum BSON document size is 16 megabytes. The field name _id is reserved for use as a primary key; its value must be unique in the collection. The field names cannot start with the $ character. The field names cannot contain the . character. NOSQL Intro with MongoDB and Cassandra 38
  • 39.   Windows http://docs.mongodb.org/manual/tutorial/installmongodb-on-windows/  MAC http://docs.mongodb.org/manual/tutorial/installmongodb-on-os-x/  Create Data Directory , Defaults  • C:datadb • /data/db/ (make sure have permissions)   Or can set using -dbpath C:mongodbbinmongod.exe --dbpath d:testmongodbdata NOSQL Intro with MongoDB and Cassandra 39
  • 40.   Database mongod Shell mongo show dbs show collections db.stats() NOSQL Intro with MongoDB and Cassandra 40
  • 41.  1_simpleinsert.txt  Insert  Find  Find all  Find One  Find with criteria  Indexes  Explain() NOSQL Intro with MongoDB and Cassandra 41
  • 42.  2_arrays_sort.txt • Embedded documents • Limit, Sort • Using regex in query • Removing documents • Drop collection NOSQL Intro with MongoDB and Cassandra 42
  • 43.   3_imp_exp.txt Mongo provides tools for getting data in and out of the database • Data Can Be Exported to json files • Json files can then be Imported NOSQL Intro with MongoDB and Cassandra 43
  • 44.  4_cond_ops.txt • • • • • $lt $gt $gte $lte $or • Also $not, $exists, $type, $in  (for $type refer to http://docs.mongodb.org/manual/reference/ope rator/query/type/#_S_type ) NOSQL Intro with MongoDB and Cassandra 44
  • 45.  Aggregation Framework   Uses a pipeline model to perform a series of operations on data. Common is a match phase (selection) and then grouping (create result) Map Reduce  Two phases  Map that creates one or more documents from each input document  Reduce phase that combines output from Map into some result  Finalize – optional that can perform some logic (e.g. sorting) on reduce output NOSQL Intro with MongoDB and Cassandra 45
  • 46.  5_admin.txt • how dbs • show collections • db.stats() • db.posts.stats() • db.posts.drop() • db.system.indexes.find() NOSQL Intro with MongoDB and Cassandra 46
  • 47. • • • • • Remember with NoSql redundancy is not evil Applications insure consistency, not the DB Application join data, not defined in the DB Datamodel is schema-less Datamodel is built to support queries usually NOSQL Intro with MongoDB and Cassandra 47
  • 48. • Your basic units of data (what would be a document)? • How are these units grouped / related? • • How does Mongo let you query this data, what are the options? Finally, maybe most importantly, what are your applications access patterns? • • • • • Reads vs. writes Queries Updates Deletions How structured is it NOSQL Intro with MongoDB and Cassandra 48
  • 49.  Normalized • Similar to relational model. • One collection per entity type • Little or no redundancy • Allows clean updates, familiar to many SQL users, easier to understand NOSQL Intro with MongoDB and Cassandra 49
  • 50. NOSQL Intro with MongoDB and Cassandra 50
  • 51. • From parent to child { name: "O'Reilly Media", books: [12346789, 234567890, ...] } • From child to parent { _id: 123456789, title: "MongoDB: The Definitive Guide", publisher_id: "oreilly" } NOSQL Intro with MongoDB and Cassandra 51
  • 52.  • • • Often used pattern in Mongo is to embed information as subdocuments. Used when there is a contains relationship Easier querying (when related data is often used together) Need to keep 16 MB document size in mind NOSQL Intro with MongoDB and Cassandra 52
  • 53. NOSQL Intro with MongoDB and Cassandra 53
  • 54.  • Many or few collections Many Collections • • • • • As seen in normalized Clean and little redundancy May not provide best performance May require frequent updates to application if new types added Multiple Collections • Middle ground, partially normalized • Not many collections • One large generic collection • Contains many types • Use type field NOSQL Intro with MongoDB and Cassandra 54
  • 55. • • Document Growth – will relocate if exceeds allocated size Atomicity • Atomic at document level • Consideration for insertions, remove and multi-document updates  Sharding – collections distributed across mongod instances, uses a shard key.  Indexes – index fields often queries, indexes affect write performance slightly  Consider using TTL to automatically expire documents NOSQL Intro with MongoDB and Cassandra 55
  • 56.  CMS Systems  Log Collection  https://code.google.com/p/log4mongo/  Caching  Queues / Messaging  Capped Collections - fixed-size collections that support high-throughput operations that insert, retrieve, and delete documents based on insertion order.  Analytics  Prototyping NOSQL Intro with MongoDB and Cassandra 56
  • 57. Mongo Driver Supplied by MongoDB Itself Easy to setup Housed on maven repo Morphia Uses App Model Handles References Well Spring Mongo Great if using Spring already NOSQL Intro with MongoDB and Cassandra 57
  • 58.  Node Javascript (JSON), Coffeescript MEAN Stack    Scala   Casbah Reactive Mongo NOSQL Intro with MongoDB and Cassandra 58
  • 59.  Get MEAN  Mongo, Express, Angular and Node    http://bitnami.com/stack/mean http://mean.io Can install, in a VM or even in the cloud NOSQL Intro with MongoDB and Cassandra 59
  • 60.     Database in the cloud https://mongolab.com/ Can access using shell, GUI Mongo explorer, mongoimport, mongoexport and use in application Amazon, Rackspace, Joyent or Azure NOSQL Intro with MongoDB and Cassandra 60
  • 61. MongoDB: The Definitive Guide, 2nd Edition By: Kristina Chodorow Publisher: O'Reilly Media, Inc. Pub. Date: May 23, 2013 Print ISBN-13: 978-1-4493-4468-9 Pages in Print Edition: 432 MongoDB in Action By: Kyle Banker Publisher: Manning Publications Pub. Date: December 16, 2011 Print ISBN-10: 1-935182-87-0 Print ISBN-13: 978-1-935182-87-0 Pages in Print Edition: 312 The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing By Eelco Plugge; Peter Membrey; Tim Hawkins Apress, September 2010 ISBN: 9781430230519 327 pages NOSQL Intro with MongoDB and Cassandra 61
  • 62. MongoDB Applied Design Patterns By: Rick Copeland Publisher: O'Reilly Media, Inc. Pub. Date: March 18, 2013 Print ISBN-13: 978-1-4493-4004-9 Pages in Print Edition: 176 MongoDB for Web Development (rough cut!) By: Mitch Pirtle Publisher: Addison-Wesley Professional Last Updated: 14-JUN-2013 Pub. Date: March 11, 2015 (Estimated) Print ISBN-10: 0-321-70533-5 Print ISBN-13: 978-0-321-70533-4 Pages in Print Edition: 360 Instant MongoDB By: Amol Nayak; Publisher: Packt Publishing Pub. Date: July 26, 2013 Print ISBN-13: 978-1-78216-970-3 Pages in Print Edition: 72 NOSQL Intro with MongoDB and Cassandra 62
  • 63. • • • • • • http://www.mongodb.org/ https://mongolab.com/welcome/ https://education.mongodb.com/ http://blog.mongodb.org/ http://stackoverflow.com/questions/tagged/ mongodb http://bitnami.com/stack/mean NOSQL Intro with MongoDB and Cassandra 63
  • 64. Let’s look briefly at Cassandra as an alternative to Mongo NOSQL Intro with MongoDB and Cassandra 64
  • 65. • Developed At Facebook, based on Google Big Table and Amazon Dynamo ** • Open Sourced in mid 2008 • Apache Project March 2009 • • • Commercial Support through Datastax (originally known as Riptano, founded 2010) Used at Netflix, eBay and many more. Reportedly 300 TB on 400 machines largest installation Current version is 2.0.3 NOSQL Intro with MongoDB and Cassandra 65
  • 66. • No Single Point of Failure – highly available. • Peer to Peer – no master • • • • • • • • Data Center Aware – distributed architecture Linear Scaling – just add hardware Eventual Consistency, tunable tradeoff between latency and consistency Architecture is optimized for writes. Can have 2 billion columns (cells)! Data modeling for reads. Design starts with looking at your queries. (sound familiar?) With CQL became more SQL-Like, but no joins, no subqueries, limited ordering (but very useful) Column Names can part of data, e.g. Time Series NOSQL Intro with MongoDB and Cassandra 66
  • 67.    ** Important Term ** Quorum : Q = N / 2 + 1. We get consistency in a BASE world by satisfying W + R > N 3 obvious ways: 1. W = 1, R = N 2. W = N, R = 1 3. W = Q, R = Q (N is replication factor, R = read replica count, W = write replica count) NOSQL Intro with MongoDB and Cassandra 67
  • 68.  C* data model is made of these:  Column – a name, a value and a timestamp. Applications can use the name as the data and not use value. (RDBMS like a column). Row – a collection of columns identified by a unique key. Key is called a partition key (RDBMS like a row).  Column Family – container for an ordered collection rows. Each row is an ordered collection of columns. Each column has a key and maybe a value. (RDBMS like a table). This is also known as a table now in C* terms.  Keyspace – administrative container for CF’s. It is a namespace. Also has a replication strategy – more late.  (RDBMS like a DB or schema). NOSQL Intro with MongoDB and Cassandra 68
  • 69. NOSQL Intro with MongoDB and Cassandra 69
  • 70.    Tokens – partitioner dependent element on the ring. Each node has a single unique token assigned. Each node claims a range of tokens that is from its token to token of the previous node on the ring. Use this formula Initial_Token= Zero_Indexed_Node_Number * ((2^127) / Number_Of_Nodes)  In cassandra.yaml initial token=42535295865117307932921825928971026432  ** http://blog.milford.io/cassandra-token-calculator/  NOSQL Intro with MongoDB and Cassandra 70
  • 71. • • Replication is how many copies of each piece of data that should be stored. In C* terms it is Replication Factor or “RF”. In C* RF is set at the keyspace level: CREATE KEYSPACE drg_compare WITH replication = {'class':'SimpleStrategy', 'replication_factor':3}; • How the data is replicated is called the Replication Strategy • SimpleStrategy – returns nodes “next” to each other on ring, Assumes single DC • NetworkTopologyStrategy – for configuring per data center. Rack and DC’s aware. update keyspace UserProfile with strategy_options=[{DC1:3, DC2:3}]; NOSQL Intro with MongoDB and Cassandra 71
  • 72. NOSQL Intro with MongoDB and Cassandra 72
  • 73.  Using token generation values from before. 4 node cluster. Write value with token 32535295865117307932921825928971026432 NOSQL Intro with MongoDB and Cassandra 73
  • 74. NOSQL Intro with MongoDB and Cassandra 74
  • 75. • • • When writing, Coordinator Node will be selected. Selected at write (or read) time. Not a SPF! Using Gossip Protocol nodes share information with each other. Who is up, who is down, who is taking which token ranges, etc. Every second, each node shares with 1 to 3 nodes. Consistency Level (CL) – says how many nodes must agree before an operation is a success. Set at read or write operation. • ONE – coordinator will wait for one node to ack write (also TWO, THREE). One is default if none provided. • QUORUM – we saw that before. N / 2 + 1. LOCAL_QUORUM, EACH_QUORUM • ANY – waits for some replicate. If all down, still succeeds. Only for writes. Doesn’t guarantee it can be read. • ALL– Blocks waiting for all replicas NOSQL Intro with MongoDB and Cassandra 75
  • 76.     3 important concepts: Read Repair - At time of read, inconsistencies are noticed between nodes and replicas are updated. Direct and background. Direct is determined by CL. Anti-Entropy Node Repair - For data that is not read frequently, or to update data on a node that has been down for a while, the nodetool repair process (also called antientropy repair). Builds Merkle trees, compares nodes and does repair. Hinted Handoff - Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online. This notification happens is via Gossip. Default 1 hour. NOSQL Intro with MongoDB and Cassandra 76
  • 77. • • Interaction with Cassandra can be done using one of supplied clients such as CLI or CQL. Otherwise client applications are built using a language client library. Many clients in multiple languages. Including Java, .NET, Python, Scala, Go, PHP, Node.js, Perl, Ruby, etc. • Java: • Hector wraps the underlying Thrift API. Hector is one of the most commonly used client libraries. • Astyanax is a client library developed by Netflix . • Datastax CQL – newest CQL driver, will be very familiar to JDBC developers • And many more … (JPA) • Also exists Datastax OPSCenter and other various GUI’s and REST API (Virgil) NOSQL Intro with MongoDB and Cassandra 77
  • 78.  Many More Topics / Information Related to C* not covered  Great for Fast Writes  No Single POF  Data Center Aware  Also Relative Ease Of Use NOSQL Intro with MongoDB and Cassandra 78
  • 79.  Questions?  Comments? Thank You!!!!!!  brian.enochson@gmail.com  NOSQL Intro with MongoDB and Cassandra 79