Your SlideShare is downloading. ×
A Practical Look at the NOSQL and Big Data Hullabaloo
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

A Practical Look at the NOSQL and Big Data Hullabaloo


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide


  • 1. A Practical Look at the NOSQL and Big Data HullabalooAndrew J. Brust Sam BisbeeCEO and Founder Senior Doing Stuff PersonBlue Badge Insights Cloudant (In Absentia) Level: Intermediate
  • 2. Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – • Co-moderator, NYC .NET Developers Group – • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News •, Twitter: @andrewbrust
  • 3. My New Blog (
  • 4. Read all about it!
  • 5. Meet Sam• Wait…you can’t. He’s not here.• Sam Bisbee – Director of Technical Business Development, Cloudant – He prefers “Senior Doing Stuff Person” Which is ironic• I’ve preserved a few of his slides. • Look for: From Sam in upper-right-hand corner
  • 6. Agenda• Why NoSQL?• NoSQL Definition(s)• Concepts• NoSQL Categories• Provisioning, market, applicability• Take-aways
  • 7. Why NoSQL?
  • 8. NoSQL Data Fodder Addresses Preferences Documents Friends, Foll Notes owers
  • 9. “Web Scale”• This the term used to justify NoSQL• Scenario is simple needs but “made up for in volume” – Millions of concurrent users• Think of sites like Amazon or Google• Think of non-transactional tasks like loading catalog data to display product page, or environment preferences
  • 11. From SamWhat is NOSQL?• “Not Only SQL” - this is not a holy war• 1870: Modern study of set theory begins• 1970: Codd writes “A Relational Model of Data for Large Shared Data Banks”• 1970 – 1980: Commercial implementations of Codds theory are released
  • 12. From SamWhat is NOSQL?• 1970 - ~2000: the same sorts of databases were made (plus a few niche products)• Dot-Com Bubble forced the same data tier problems but at a new scale (Amazon), forcing innovation out of necessity• 2000 – present: innovations are becoming open source and “main stream” (Hadoop)
  • 13. From SamSo What is NOSQL Really?New ways of looking at dynamic data storage and querying for larger scale systems. (scale = concurrent users and data size)
  • 14. NoSQL Common Traits• Non-relational• Non-schematized/schema-free• Open source• Distributed• Eventual consistency• “Web scale”• Developed at big Internet companies
  • 15. CONCEPTS
  • 16. Consistency• CAP Theorem – Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance• NoSQL does not offer “ACID” guarantees – Atomicity, consistency, isolation and durability• Instead offers “eventual consistency” – Similar to DNS propagation
  • 17. Consistency• Things like inventory, account balances should be consistent – Imagine updating a server in Seattle that stock was depleted – Imagine not updating the server in NY – Customer in NY goes to order 50 pieces of the item – Order processed even though no stock• Things like catalog information don’t have to be, at least not immediately – If a new item is entered into t he catalog, it’s OK for some customers to see it even before the other customers’ server know about it• But catalog info must come up quickly – Therefore don’t lock data in one location while waiting to update he other• Therefore, OK to sacrifice consistency for speed, in some cases
  • 18. CAP TheoremRelational Consistency NoSQL Partition Availability Tolerance
  • 19. Indexing• Most NoSQL databases are indexed by key• Some allow so-called “secondary” indexes• Often the primary key indexes are clustered• HBase uses Hadoop Distributed File System, which is append-only – Writes are logged – Logged writes are batched – File is re-created and sorted
  • 20. Queries• Typically no query language• Instead, create procedural program• Sometimes SQL is supported• Sometimes MapReduce code is used…
  • 21. MapReduce• Map step: pre-processes data• Reduce step: summarizes/aggregates data• Most typical of Hadoop and used with Wide Column Stores, esp. HBase• Amazon Web Services’ Elastic MapReduce (EMR) can read/write DynamoDB, S3, Relational Database Service (RDS)• “Hive” offers a HiveQL (SQL-like) abstraction over MR – Use with Hive tables – Use with HBase
  • 22. Sharding• A partitioning pattern where separate servers store partitions• Fan-out queries supported• Partitions may be duplicated, so replication also provided – Good for disaster recovery• Since “shards” can be geographically distributed, sharding can act like a CDN• Good for keeping data close to processing – Reduces network traffic when MapReduce splitting takes place
  • 24. Key-Value Stores• The most common; not necessarily the most popular• Has rows, each with something like a big dictionary/associative array – Schema may differ from row to row• Common on Cloud platforms – e.g. Amazon SimpleDB, Azure Table Storage• MemcacheDB, Voldemort, Couchbase• DynamoDB (AWS), Dynomite, Redis and Riak
  • 25. Key-Value StoresDatabase Table: Customers Table: Orders Row ID: 101 Row ID: 1501 First_Name: Andrew Price: 300 USD Last_Name: Brust Item1: 52134 Address: 123 Main Street Item2: 24457 Last_Order: 1501 Row ID: 202 Row ID: 1502 First_Name: Jane Price: 2500 GBP Last_Name: Doe Item1: 98456 Address: 321 Elm Street Item2: 59428 Last_Order: 1502
  • 26. Wide Column Stores• Has tables with declared column families – Each column family has “columns” which are KV pair that can vary from row to row• These are the most foundational for large sites – Big Table (Google) – HBase (Originally part of Yahoo-dominated Hadoop project) – Cassandra (Facebook) Calls column families “super columns” and tables “super column families”• They are the most “Big Data”-ready – Especially HBase + Hadoop
  • 27. Wide Column StoresTable: Customers Table: Orders Row ID: 101 Super Column: Name Column: First_Name: Row ID: 1501 Andrew Super Column: Pricing Column: Last_Name: Brust Column: Price: 300 USD Super Column: Address Super Column: Items Column: Number: 123 Column: Item1: 52134 Column: Street: Main Street Column: Item2: 24457 Super Column: Orders Column: Last_Order: 1501 Row ID: 202 Row ID: 1502 Super Column: Name Column: First_Name: Jane Super Column: Pricing Column: Last_Name: Doe Column: Price: 2500 Super Column: Address GBP Column: Number: 321 Super Column: Items Column: Street: Elm Street Column: Item1: 98456 Super Column: Orders Column: Item2: 59428 Column: Last_Order: 1502
  • 28. Wide Column Stores
  • 29. Document Stores• Have “databases,” which are akin to tables• Have “documents,” akin to rows – Documents are typically JSON objects – Each document has properties and values – Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage) – Can have attachments as well• Old versions are retained – So Doc Stores work well for content management• Some view doc stores as specialized KV stores• Most popular with developers, startups, VCs• The biggies: – CouchDB – Derivatives – MongoDB
  • 30. Document StoreApplication Orientation• Documents can each be addressed by URIs• CouchDB supports full REST interface• Very geared towards JavaScript and JSON – Documents are JSON objects – CouchDB/MongoDB use JavaScript as native language• In CouchDB, “view functions” also have unique URIs and they return HTML – So you can build entire applications in the database
  • 31. Document StoresDatabase: Customers Database: Orders Document ID: 101 First_Name: Andrew Last_Name: Brust Address: Document ID: 1501 Price: 300 USD Number: 123 Item1: 52134 Street: Main Street Item2: 24457 Orders: Most_recent: 1501 Document ID: 202 First_Name: Jane Last_Name: Doe Document ID: 1502 Address: Price: 2500 GBP Number: 321 Item1: 98456 Street: Elm Street Item2: 59428 Orders: Most_recent: 1502
  • 32. Document Stores
  • 33. Graph Databases• Great for social network applications and others where relationships are important• Nodes and edges – Edge like a join – Nodes like rows in a table• Nodes can also have properties and values• Neo4j is a popular graph db
  • 34. Graph DatabasesDatabase George Washington Street: 123 Main Street City: New York Friend of State: NY Zip: 10014 Address Placed order Andrew Brust ID: 252 Total Price: 300 USD Item1 Item2 Joe Smith Jane Doe ID: 52134 ID: 24457 Type: Dress Type: Shirt Color: Blue Color: Red Commented on Sent invitation to photo by
  • 36. NoSQL on Windows Azure• Platform as a Service – Cloudant: – MongoDB (via MongoLab):• MongoDB, DIY: – On an Azure Worker Role: e+Worker+Roles – On a Windows VM: e+VM+-+Windows+Installer – On a Linux VM: e+VM+-+Linux+Tutorial us/manage/linux/common-tasks/mongodb-on-a-linux-vm/
  • 37. NoSQL on Windows Azure• Others, DIY (Linux VMs): – Couchbase: new-windows-azure – CouchDB: db-installer-for-windows-azure – Riak: Microsoft-Azure/ – Redis: redis-on-a-centos-linux-vm-in-windows-azure.aspx – Cassandra: us/manage/linux/other-resources/how-to-run-cassandra- with-linux/
  • 38. From SamThe High-Level Shake Out• Hadoop will continue to crush data warehousing• MongoDB will be the top MySQL / on-prem alternative• Cloudant will be the top as-a-Service / Cloud database• Basho is pivoting toward cloud object store
  • 39. NoSQL + BI• NoSQL databases are bad for ad hoc query and data warehousing• BI applications involve models; models rely on schema• Extract, transform and load (ETL) may be your friend• Wide-column stores, however are good for “Big Data” – See next slide• Wide-column stores and column-oriented databases are similar technologically
  • 40. NoSQL + Big Data• Big Data and NoSQL are interrelated• Typically, Wide-Column stores used in Big Data scenarios• Prime example: – HBase and Hadoop• Why? – Lack of indexing not a problem – Consistency not an issue – Fast reads very important – Distributed files systems important too – Commodity hardware and disk assumptions also important – Not Web scale but massive scale-out, so similar concerns
  • 41. TAKE-AWAYS
  • 42. Compromises• Eventual consistency• Write buffering• Only primary keys can be indexed• Queries must be written as programs• Tooling – Productivity (= money)
  • 43. Summing Up• Line of Business -> Relational• Large, public (consumer)-facing sites -> NoSQL• Complex data structures -> Relational• Big Data -> NoSQL• Transactional -> Relational• Content Management -> NoSQL• Enterprise->Relational• Consumer Web -> NoSQL
  • 44. Thank you•• @andrewbrust on twitter• Want to get the free “Redmond Roundup Plus?” Text “bluebadge” to 22828