No SQL Technologies


Published on

Introduction to NoSQL Technologies

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

No SQL Technologies

  1. 1. What Should I Know about NoSQL? Cris J. Holdorph Software Architect Unicon, Inc. Jasig Conference Westminster, CO May 24, 2011© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under aCreative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.To view a copy of this license, visit
  2. 2. Lethal SQL 2
  3. 3. 3
  4. 4. Agenda1. Definitions2. History3. Projects4. Example Case Studies 4
  5. 5. Definitions 5
  6. 6. Definitions● RDBMS● SQL● CRUD● ACID – Atomicity, Consistency, Isolation, Durability● BASE – Basically Available, Soft state, Eventual consistency 6
  7. 7. 7
  8. 8. Definitions● Big Data● Sharding● Cloud Computing● Distributed File System● Key Value Store 8
  9. 9. History 9
  10. 10. Map Reduce● Patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.● Naming originally inspired by map and reduce functions of functional programming (but their purpose is not the same as it was there)● Map – The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes● Reduce – The master node then takes the answers to all the sub- problems and combines them in some way to get the output 10
  11. 11. What does NoSQL Stand For?● NoSQL● No SQL● Not SQL● Not Only SQL● Not the RDBMS● Wikipedia: – Carlo Strozzi used the term "NoSQL" in 1998 to name his lightweight, open-source relational database that did not expose an SQL interface. 11
  12. 12. History● Some techniques have existed for over 25 years● Teradata selling product for more then 20 years● RDBMS dates back to 1970 12
  13. 13. CAP Theorem● A conjecture made by Eric Brewer at the Symposium on Principles of Distributed Computing (2000)● States only possible to achieve 2 of 3 – Consistency (all nodes see the same data at the same time) – Availability (node failures do not prevent survivors from continuing to operate) – Partition Tolerance (the system continues to operate despite arbitrary message loss) 13
  14. 14. CAP● Consistent and Available – ACID systems, MySQL cluster, Oracle Coherence, Drizzle● Consistent and Partition Tolerance – SCLA (strongly consistent, loosely available) – HBase, Bigtable● Available and Partition Tolerant – BASE systems (CouchDB, SimpleDB, MongoDB● Cassandra (sits between SCLA/BASE systems) 14
  15. 15. Projects 15
  16. 16. Hadoop● Open-source software for reliable, scalable, distributed computing (Hadoop website) – Hadoop Common – HDFS – MapReduce● Created Initially in early 2006 to support search engine project Nutch● Inspired by the Google File System and MapReduce papers (Oct 2003) 16
  17. 17. Hadoop Related Projects● Hbase – A scalable, distributed database that supports structured data storage for large tables● Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying● Pig – A high-level data-flow language and execution framework for parallel computation● Cassandra – uses Hadoop for MapReduce 17
  18. 18. Who Uses Hadoop● EBay (532 nodes, Search optimization)● Facebook (1100x8 node cluster, 300x8 node cluster, more on this later)● GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)● Hulu (log storage analysis)● (44x2 nodes log analysis, 20x2 nodes profile analysis)● LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may know")● Twitter (more on this later)● Yahoo! (100,000 cpus running Hadoop, more on this later) 18
  19. 19. CouchDB● Apache open source document oriented database written in Erlang (concurrent programming lang)● Designed to scale horizontally● Stores documents (one or more field value pairs expressed as JSON)● ACID Semantics● Map/Reduce Views and Indexes (written in server side javascript)● Bi-direction replication (with conflict resolution)● REST API 19
  20. 20. 20
  21. 21. CouchDB Sample Document"Subject": "I like Plankton""Author": "Rusty""PostedDate": "5/23/2006""Tags": ["plankton", "baseball", "decisions"]"Body": "I decided today that I dont like baseball. Ilike plankton." 21
  22. 22. Who uses CouchDB?● Ubuntu One – cloud storage service –● "I Play WoW" facebook app –● Wego - travel site – 22
  23. 23. Cassandra● Fault Tolerant (replication, failed nodes can be replaced with no downtime)● Decentralized (ever node in cluster is identical, no bottlenicks)● Supports either Synchronous or Asynchronous update replication● Supports more then simple key/value pair● Elastic (read/write throughput increase linearly as machines are added)● Durable (suitable for applictions that cant 23 afford to lose data)
  24. 24. Cassandra● Initially developed by Facebook for Inbox Search (until replaced by HBase)● Key-value store where values can be multiple values● Some inspiration from Amazons Dynamo (another key-value store) 24
  25. 25. Who uses Cassandra?● Facebook (previously)● Twitter● Digg● Cisco 25
  26. 26. MongoDB● Name is derived from "humongous"● Document oriented database written in C++● Manages collections of JSON-like documents● Binaries available for windows, linux, OS X, Solaris● Supports dates, regular expressions code, binary data (all BSON types)● Cursors for query results● Any field can be queried at any time 26
  27. 27. MongoDB● Queries can include user-defined JavaScript functions● Master/Slave (only master supports writes, slaves can be read from)● Scales horizontally using sharding● Support for Map/Reduce 27
  28. 28. Who uses MongoDB?● New York Times● Shutterfly● Foursquare● SourceForge● Intuit 28
  29. 29. Google Big Table● Built on GFS (Google File System)● Can be used with Google App Engine● Maps two aribtrary strings and a timestamp● Designed to scale into the petabyte range● Designed to scale across hundreds or thousands of machines● Portions of a table (tablets) can be compressed● HBase was modeled after BigTable 29
  30. 30. Who uses Big Table?● Google Reader● Google Maps● Google Book Search● Google Earth●● Google Code● Orkut● YouTube● Gmail 30
  31. 31. Amazon SimpleDB● Written in Erlang● Used with Amazon EC2 and Amazon S3● Easy access to lookup and query functions● Without support for the less used complex database functions● Do not need to pre-define data formats that will be stored● Scalable (with size limitations) – 10gb per domain, up to 250 domains● Fast/Reliable● Supports eventually consistent read and consistent read● Potentially Inexpensive 31
  32. 32. SimpleDB Data Model 32
  33. 33. SimpleDB Data Model● Customer Account (amazon web services account)● Domains (similar to tables, or spreadsheet tabs)● Items (similar to rows)● Attributes (similar to columns)● Values (similar to cells) – Unlike a spreadsheet, however, multiple values can be associated with a cell● One domain can contain different types of data (some attributes not filled in) 33
  34. 34. SimpleDB API Summary● CreateDomain● DeleteDomain● ListDomains● PutAttributes● BatchPutAttributes● DeleteAttributes● BatchDeleteAttributes● GetAttributes● Select● DomainMetadata 34
  35. 35. Who uses SimpleDB?● Netflix● Other Amazon EC2 customers... 35
  36. 36. memcached● General purpose distributed memory caching system● Often used to cache in RAM that might otherwise be obtained from an external data source● LRU (when cache is full)● Can be distributed across multiple machines 36
  37. 37. Who uses memcached?● YouTube● Zynga● Facebook● Twitter 37
  38. 38. Terracotta● JVM in-memory distributed cache / store● The object store can be persistent● Distribution between nodes is handled through Terracotta server● Supports multiple Terracotta servers● Nodes only receive data they need/reference 38
  39. 39. Who uses Terracotta?● Sakai (thanks to John Wiley & Sons)● PartyGaming (● Adobe● Pearson 39
  40. 40. Example Case Studies 40
  41. 41. Yahoo!● Hadoop – – More than 100,000 CPUs in >36,000 computers running Hadoop – Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop on larger clusters – >60% of Hadoop Jobs within Yahoo are Pig jobs 41
  42. 42. Twitter● How Twitter Uses NoSQL –● Scribe – Syslog stopped scaling● Hadoop – Needs to store more data per day than it can reliably write to a single hard drive● Pig – Used for interacting with Hadoop● Hbase – People Search● FlockDB – Social Graph Analysis 42
  43. 43. Netflix ● NoSQL at Netflix – ● SimpleDB – Highly durable, with writes automatically replicated across availability zones within a region – Love it when others do heavy lifting for us● Hadoop/HBase – Convenient, high-performance column-oriented distributed database solution – HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime● Cassandra – Adding more servers, without the need to re-shard 43
  44. 44. Facebook●● 350 million users sending over 15 billion person-to-person messages per month● Chat service supports over 300 million users who send over 120 billion messages per month● Two patterns emerged – A short set of temporal data that tends to be volatile – An ever-growing set of data that rarely gets accessed● Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems – MySQL proved to not handle the long tail of data well (as indexes/data grows large performance suffers – Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. 44
  45. 45. “There is a learning curve and anoperational overhead. Still, the scalability,availability and performance advantages ofthe NoSQL persistence model are evidentand are paying for themselves already, andwill be central to our long-term cloudstrategy.” Yury Izrailevsky, Netflix 45
  46. 46. Questions & Answers Cris J. Holdorph Software Architect Unicon, Inc. Twitter: @holdorph 46