A Practical Look at the NOSQL and Big Data Hullabaloo
Upcoming SlideShare
Loading in...5

A Practical Look at the NOSQL and Big Data Hullabaloo






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

A Practical Look at the NOSQL and Big Data Hullabaloo A Practical Look at the NOSQL and Big Data Hullabaloo Presentation Transcript

  • A Practical Look at the NOSQL and Big Data HullabalooAndrew J. Brust Sam BisbeeCEO and Founder Senior Doing Stuff PersonBlue Badge Insights Cloudant (In Absentia) Level: Intermediate
  • Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust
  • My New Blog (bit.ly/bigondata)
  • Read all about it!
  • Meet Sam• Wait…you can’t. He’s not here.• Sam Bisbee – Director of Technical Business Development, Cloudant – He prefers “Senior Doing Stuff Person” Which is ironic• I’ve preserved a few of his slides. • Look for: From Sam in upper-right-hand corner
  • Agenda• Why NoSQL?• NoSQL Definition(s)• Concepts• NoSQL Categories• Provisioning, market, applicability• Take-aways
  • Why NoSQL?
  • NoSQL Data Fodder Addresses Preferences Documents Friends, Foll Notes owers
  • “Web Scale”• This the term used to justify NoSQL• Scenario is simple needs but “made up for in volume” – Millions of concurrent users• Think of sites like Amazon or Google• Think of non-transactional tasks like loading catalog data to display product page, or environment preferences
  • From SamWhat is NOSQL?• “Not Only SQL” - this is not a holy war• 1870: Modern study of set theory begins• 1970: Codd writes “A Relational Model of Data for Large Shared Data Banks”• 1970 – 1980: Commercial implementations of Codds theory are released
  • From SamWhat is NOSQL?• 1970 - ~2000: the same sorts of databases were made (plus a few niche products)• Dot-Com Bubble forced the same data tier problems but at a new scale (Amazon), forcing innovation out of necessity• 2000 – present: innovations are becoming open source and “main stream” (Hadoop)
  • From SamSo What is NOSQL Really?New ways of looking at dynamic data storage and querying for larger scale systems. (scale = concurrent users and data size)
  • NoSQL Common Traits• Non-relational• Non-schematized/schema-free• Open source• Distributed• Eventual consistency• “Web scale”• Developed at big Internet companies
  • Consistency• CAP Theorem – Databases may only excel at two of the following three attributes: consistency, availability and partition tolerance• NoSQL does not offer “ACID” guarantees – Atomicity, consistency, isolation and durability• Instead offers “eventual consistency” – Similar to DNS propagation
  • Consistency• Things like inventory, account balances should be consistent – Imagine updating a server in Seattle that stock was depleted – Imagine not updating the server in NY – Customer in NY goes to order 50 pieces of the item – Order processed even though no stock• Things like catalog information don’t have to be, at least not immediately – If a new item is entered into t he catalog, it’s OK for some customers to see it even before the other customers’ server know about it• But catalog info must come up quickly – Therefore don’t lock data in one location while waiting to update he other• Therefore, OK to sacrifice consistency for speed, in some cases
  • CAP TheoremRelational Consistency NoSQL Partition Availability Tolerance
  • Indexing• Most NoSQL databases are indexed by key• Some allow so-called “secondary” indexes• Often the primary key indexes are clustered• HBase uses Hadoop Distributed File System, which is append-only – Writes are logged – Logged writes are batched – File is re-created and sorted
  • Queries• Typically no query language• Instead, create procedural program• Sometimes SQL is supported• Sometimes MapReduce code is used…
  • MapReduce• Map step: pre-processes data• Reduce step: summarizes/aggregates data• Most typical of Hadoop and used with Wide Column Stores, esp. HBase• Amazon Web Services’ Elastic MapReduce (EMR) can read/write DynamoDB, S3, Relational Database Service (RDS)• “Hive” offers a HiveQL (SQL-like) abstraction over MR – Use with Hive tables – Use with HBase
  • Sharding• A partitioning pattern where separate servers store partitions• Fan-out queries supported• Partitions may be duplicated, so replication also provided – Good for disaster recovery• Since “shards” can be geographically distributed, sharding can act like a CDN• Good for keeping data close to processing – Reduces network traffic when MapReduce splitting takes place
  • Key-Value Stores• The most common; not necessarily the most popular• Has rows, each with something like a big dictionary/associative array – Schema may differ from row to row• Common on Cloud platforms – e.g. Amazon SimpleDB, Azure Table Storage• MemcacheDB, Voldemort, Couchbase• DynamoDB (AWS), Dynomite, Redis and Riak
  • Key-Value StoresDatabase Table: Customers Table: Orders Row ID: 101 Row ID: 1501 First_Name: Andrew Price: 300 USD Last_Name: Brust Item1: 52134 Address: 123 Main Street Item2: 24457 Last_Order: 1501 Row ID: 202 Row ID: 1502 First_Name: Jane Price: 2500 GBP Last_Name: Doe Item1: 98456 Address: 321 Elm Street Item2: 59428 Last_Order: 1502
  • Wide Column Stores• Has tables with declared column families – Each column family has “columns” which are KV pair that can vary from row to row• These are the most foundational for large sites – Big Table (Google) – HBase (Originally part of Yahoo-dominated Hadoop project) – Cassandra (Facebook) Calls column families “super columns” and tables “super column families”• They are the most “Big Data”-ready – Especially HBase + Hadoop
  • Wide Column StoresTable: Customers Table: Orders Row ID: 101 Super Column: Name Column: First_Name: Row ID: 1501 Andrew Super Column: Pricing Column: Last_Name: Brust Column: Price: 300 USD Super Column: Address Super Column: Items Column: Number: 123 Column: Item1: 52134 Column: Street: Main Street Column: Item2: 24457 Super Column: Orders Column: Last_Order: 1501 Row ID: 202 Row ID: 1502 Super Column: Name Column: First_Name: Jane Super Column: Pricing Column: Last_Name: Doe Column: Price: 2500 Super Column: Address GBP Column: Number: 321 Super Column: Items Column: Street: Elm Street Column: Item1: 98456 Super Column: Orders Column: Item2: 59428 Column: Last_Order: 1502
  • Wide Column Stores
  • Document Stores• Have “databases,” which are akin to tables• Have “documents,” akin to rows – Documents are typically JSON objects – Each document has properties and values – Values can be scalars, arrays, links to documents in other databases or sub-documents (i.e. contained JSON objects - Allows for hierarchical storage) – Can have attachments as well• Old versions are retained – So Doc Stores work well for content management• Some view doc stores as specialized KV stores• Most popular with developers, startups, VCs• The biggies: – CouchDB – Derivatives – MongoDB
  • Document StoreApplication Orientation• Documents can each be addressed by URIs• CouchDB supports full REST interface• Very geared towards JavaScript and JSON – Documents are JSON objects – CouchDB/MongoDB use JavaScript as native language• In CouchDB, “view functions” also have unique URIs and they return HTML – So you can build entire applications in the database
  • Document StoresDatabase: Customers Database: Orders Document ID: 101 First_Name: Andrew Last_Name: Brust Address: Document ID: 1501 Price: 300 USD Number: 123 Item1: 52134 Street: Main Street Item2: 24457 Orders: Most_recent: 1501 Document ID: 202 First_Name: Jane Last_Name: Doe Document ID: 1502 Address: Price: 2500 GBP Number: 321 Item1: 98456 Street: Elm Street Item2: 59428 Orders: Most_recent: 1502
  • Document Stores
  • Graph Databases• Great for social network applications and others where relationships are important• Nodes and edges – Edge like a join – Nodes like rows in a table• Nodes can also have properties and values• Neo4j is a popular graph db
  • Graph DatabasesDatabase George Washington Street: 123 Main Street City: New York Friend of State: NY Zip: 10014 Address Placed order Andrew Brust ID: 252 Total Price: 300 USD Item1 Item2 Joe Smith Jane Doe ID: 52134 ID: 24457 Type: Dress Type: Shirt Color: Blue Color: Red Commented on Sent invitation to photo by
  • NoSQL on Windows Azure• Platform as a Service – Cloudant: https://cloudant.com/azure/ – MongoDB (via MongoLab): http://blog.mongolab.com/2012/10/azure/• MongoDB, DIY: – On an Azure Worker Role: http://www.mongodb.org/display/DOCS/MongoDB+on+Azur e+Worker+Roles – On a Windows VM: http://www.mongodb.org/display/DOCS/MongoDB+on+Azur e+VM+-+Windows+Installer – On a Linux VM: http://www.mongodb.org/display/DOCS/MongoDB+on+Azur e+VM+-+Linux+Tutorial http://www.windowsazure.com/en- us/manage/linux/common-tasks/mongodb-on-a-linux-vm/
  • NoSQL on Windows Azure• Others, DIY (Linux VMs): – Couchbase: http://blog.couchbase.com/couchbase-server- new-windows-azure – CouchDB: http://ossonazure.interoperabilitybridges.com/articles/couch db-installer-for-windows-azure – Riak: http://basho.com/blog/technical/2012/10/09/Riak-on- Microsoft-Azure/ – Redis: http://blogs.msdn.com/b/tconte/archive/2012/06/08/running- redis-on-a-centos-linux-vm-in-windows-azure.aspx – Cassandra: http://www.windowsazure.com/en- us/manage/linux/other-resources/how-to-run-cassandra- with-linux/
  • From SamThe High-Level Shake Out• Hadoop will continue to crush data warehousing• MongoDB will be the top MySQL / on-prem alternative• Cloudant will be the top as-a-Service / Cloud database• Basho is pivoting toward cloud object store
  • NoSQL + BI• NoSQL databases are bad for ad hoc query and data warehousing• BI applications involve models; models rely on schema• Extract, transform and load (ETL) may be your friend• Wide-column stores, however are good for “Big Data” – See next slide• Wide-column stores and column-oriented databases are similar technologically
  • NoSQL + Big Data• Big Data and NoSQL are interrelated• Typically, Wide-Column stores used in Big Data scenarios• Prime example: – HBase and Hadoop• Why? – Lack of indexing not a problem – Consistency not an issue – Fast reads very important – Distributed files systems important too – Commodity hardware and disk assumptions also important – Not Web scale but massive scale-out, so similar concerns
  • Compromises• Eventual consistency• Write buffering• Only primary keys can be indexed• Queries must be written as programs• Tooling – Productivity (= money)
  • Summing Up• Line of Business -> Relational• Large, public (consumer)-facing sites -> NoSQL• Complex data structures -> Relational• Big Data -> NoSQL• Transactional -> Relational• Content Management -> NoSQL• Enterprise->Relational• Consumer Web -> NoSQL
  • Thank you• andrew.brust@bluebadgeinsights.com• @andrewbrust on twitter• Want to get the free “Redmond Roundup Plus?” Text “bluebadge” to 22828