What Should I Know                          about NoSQL?                                                                  ...
Lethal SQL             2
3
Agenda1. Definitions2. History3. Projects4. Example Case Studies                          4
Definitions              5
Definitions●    RDBMS●    SQL●    CRUD●    ACID    –   Atomicity, Consistency, Isolation, Durability●    BASE    –   Basic...
7
Definitions●    Big Data●    Sharding●    Cloud Computing●    Distributed File System●    Key Value Store                 ...
History          9
Map Reduce●    Patented software framework introduced by Google    in 2004 to support distributed computing on large    da...
What does NoSQL Stand For?●    NoSQL●    No SQL●    Not SQL●    Not Only SQL●    Not the RDBMS●    Wikipedia:    –   Carlo...
History●    Some techniques have existed for over 25     years●    Teradata selling product for more then 20      years●  ...
CAP Theorem●    A conjecture made by Eric Brewer at the      Symposium on Principles of Distributed      Computing (2000)●...
CAP●    Consistent and Available    –   ACID systems, MySQL cluster, Oracle Coherence,        Drizzle●    Consistent and P...
Projects           15
Hadoop●    Open-source software for reliable, scalable,     distributed computing (Hadoop website)    –   Hadoop Common   ...
Hadoop Related Projects●    Hbase    –   A scalable, distributed database that supports        structured data storage for...
Who Uses Hadoop●    EBay (532 nodes, Search optimization)●    Facebook (1100x8 node cluster, 300x8 node cluster, more on  ...
CouchDB●    Apache open source document oriented database    written in Erlang (concurrent programming lang)●    Designed ...
http://couchdb.apache.org/img/sketch.png                                           20
CouchDB Sample Document"Subject": "I like Plankton""Author": "Rusty""PostedDate": "5/23/2006""Tags": ["plankton", "basebal...
Who uses CouchDB?●    Ubuntu One – cloud storage service    –   http://ubuntuone.com/●    "I Play WoW" facebook app    –  ...
Cassandra●    Fault Tolerant (replication, failed nodes can    be replaced with no downtime)●    Decentralized (ever node ...
Cassandra●    Initially developed by Facebook for Inbox    Search (until replaced by HBase)●    Key-value store where valu...
Who uses Cassandra?●    Facebook (previously)●    Twitter●    Digg●    Cisco                                    25
MongoDB●    Name is derived from "humongous"●    Document oriented database written in C++●    Manages collections of JSON...
MongoDB●    Queries can include user-defined JavaScript    functions●    Master/Slave (only master supports writes,    sla...
Who uses MongoDB?●    New York Times●    Shutterfly●    Foursquare●    SourceForge●    Intuit                             ...
Google Big Table●    Built on GFS (Google File System)●    Can be used with Google App Engine●    Maps two aribtrary strin...
Who uses Big Table?●    Google Reader●    Google Maps●    Google Book Search●    Google Earth●    Blogger.com●    Google C...
Amazon SimpleDB●    Written in Erlang●    Used with Amazon EC2 and Amazon S3●    Easy access to lookup and query functions...
SimpleDB Data Modelhttp://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html                  ...
SimpleDB Data Model●    Customer Account (amazon web services account)●    Domains (similar to tables, or spreadsheet tabs...
SimpleDB API Summary●    CreateDomain●    DeleteDomain●    ListDomains●    PutAttributes●    BatchPutAttributes●    Delete...
Who uses SimpleDB?●    Netflix●    Other Amazon EC2 customers...                                    35
memcached●    General purpose distributed memory caching system●    Often used to cache in RAM that might otherwise be    ...
Who uses memcached?●    YouTube●    Zynga●    Facebook●    Twitter                                    37
Terracotta●    JVM in-memory distributed cache / store●    The object store can be persistent●    Distribution between nod...
Who uses Terracotta?●    Sakai (thanks to John Wiley & Sons)●    PartyGaming (PartyPoker.com)●    Adobe●    Pearson       ...
Example Case Studies                       40
Yahoo!●    Hadoop    –   http://developer.yahoo.com/blogs/hadoop    –   More than 100,000 CPUs in >36,000 computers       ...
Twitter●    How Twitter Uses NoSQL    –   http://goo.gl/Bwxoe●    Scribe    –   Syslog stopped scaling●    Hadoop    –   N...
Netflix    ●        NoSQL at Netflix         –   http://goo.gl/SDcsZ    ●        SimpleDB         –   Highly durable, with...
Facebook●    http://goo.gl/J9EVW●    350 million users sending over 15 billion person-to-person messages    per month●    ...
“There is a learning curve and anoperational overhead. Still, the scalability,availability and performance advantages ofth...
Questions & Answers         Cris J. Holdorph         Software Architect         Unicon, Inc.         Twitter: @holdorph   ...
Upcoming SlideShare
Loading in...5
×

No SQL Technologies

1,788

Published on

Introduction to NoSQL Technologies

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,788
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

No SQL Technologies

  1. 1. What Should I Know about NoSQL? Cris J. Holdorph Software Architect Unicon, Inc. Jasig Conference Westminster, CO May 24, 2011© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under aCreative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/
  2. 2. Lethal SQL 2
  3. 3. 3
  4. 4. Agenda1. Definitions2. History3. Projects4. Example Case Studies 4
  5. 5. Definitions 5
  6. 6. Definitions● RDBMS● SQL● CRUD● ACID – Atomicity, Consistency, Isolation, Durability● BASE – Basically Available, Soft state, Eventual consistency 6
  7. 7. 7
  8. 8. Definitions● Big Data● Sharding● Cloud Computing● Distributed File System● Key Value Store 8
  9. 9. History 9
  10. 10. Map Reduce● Patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.● Naming originally inspired by map and reduce functions of functional programming (but their purpose is not the same as it was there)● Map – The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes● Reduce – The master node then takes the answers to all the sub- problems and combines them in some way to get the output 10
  11. 11. What does NoSQL Stand For?● NoSQL● No SQL● Not SQL● Not Only SQL● Not the RDBMS● Wikipedia: – Carlo Strozzi used the term "NoSQL" in 1998 to name his lightweight, open-source relational database that did not expose an SQL interface. 11
  12. 12. History● Some techniques have existed for over 25 years● Teradata selling product for more then 20 years● RDBMS dates back to 1970 12
  13. 13. CAP Theorem● A conjecture made by Eric Brewer at the Symposium on Principles of Distributed Computing (2000)● States only possible to achieve 2 of 3 – Consistency (all nodes see the same data at the same time) – Availability (node failures do not prevent survivors from continuing to operate) – Partition Tolerance (the system continues to operate despite arbitrary message loss) 13
  14. 14. CAP● Consistent and Available – ACID systems, MySQL cluster, Oracle Coherence, Drizzle● Consistent and Partition Tolerance – SCLA (strongly consistent, loosely available) – HBase, Bigtable● Available and Partition Tolerant – BASE systems (CouchDB, SimpleDB, MongoDB● Cassandra (sits between SCLA/BASE systems) 14
  15. 15. Projects 15
  16. 16. Hadoop● Open-source software for reliable, scalable, distributed computing (Hadoop website) – Hadoop Common – HDFS – MapReduce● Created Initially in early 2006 to support search engine project Nutch● Inspired by the Google File System and MapReduce papers (Oct 2003) 16
  17. 17. Hadoop Related Projects● Hbase – A scalable, distributed database that supports structured data storage for large tables● Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying● Pig – A high-level data-flow language and execution framework for parallel computation● Cassandra – uses Hadoop for MapReduce 17
  18. 18. Who Uses Hadoop● EBay (532 nodes, Search optimization)● Facebook (1100x8 node cluster, 300x8 node cluster, more on this later)● GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)● Hulu (log storage analysis)● Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis)● LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may know")● Twitter (more on this later)● Yahoo! (100,000 cpus running Hadoop, more on this later) 18
  19. 19. CouchDB● Apache open source document oriented database written in Erlang (concurrent programming lang)● Designed to scale horizontally● Stores documents (one or more field value pairs expressed as JSON)● ACID Semantics● Map/Reduce Views and Indexes (written in server side javascript)● Bi-direction replication (with conflict resolution)● REST API 19
  20. 20. http://couchdb.apache.org/img/sketch.png 20
  21. 21. CouchDB Sample Document"Subject": "I like Plankton""Author": "Rusty""PostedDate": "5/23/2006""Tags": ["plankton", "baseball", "decisions"]"Body": "I decided today that I dont like baseball. Ilike plankton." http://couchdb.apache.org/docs/intro.html 21
  22. 22. Who uses CouchDB?● Ubuntu One – cloud storage service – http://ubuntuone.com/● "I Play WoW" facebook app – http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html● Wego - travel site – http://www.wego.com/ 22
  23. 23. Cassandra● Fault Tolerant (replication, failed nodes can be replaced with no downtime)● Decentralized (ever node in cluster is identical, no bottlenicks)● Supports either Synchronous or Asynchronous update replication● Supports more then simple key/value pair● Elastic (read/write throughput increase linearly as machines are added)● Durable (suitable for applictions that cant 23 afford to lose data)
  24. 24. Cassandra● Initially developed by Facebook for Inbox Search (until replaced by HBase)● Key-value store where values can be multiple values● Some inspiration from Amazons Dynamo (another key-value store) 24
  25. 25. Who uses Cassandra?● Facebook (previously)● Twitter● Digg● Cisco 25
  26. 26. MongoDB● Name is derived from "humongous"● Document oriented database written in C++● Manages collections of JSON-like documents● Binaries available for windows, linux, OS X, Solaris● Supports dates, regular expressions code, binary data (all BSON types)● Cursors for query results● Any field can be queried at any time 26
  27. 27. MongoDB● Queries can include user-defined JavaScript functions● Master/Slave (only master supports writes, slaves can be read from)● Scales horizontally using sharding● Support for Map/Reduce 27
  28. 28. Who uses MongoDB?● New York Times● Shutterfly● Foursquare● SourceForge● Intuit 28
  29. 29. Google Big Table● Built on GFS (Google File System)● Can be used with Google App Engine● Maps two aribtrary strings and a timestamp● Designed to scale into the petabyte range● Designed to scale across hundreds or thousands of machines● Portions of a table (tablets) can be compressed● HBase was modeled after BigTable 29
  30. 30. Who uses Big Table?● Google Reader● Google Maps● Google Book Search● Google Earth● Blogger.com● Google Code● Orkut● YouTube● Gmail 30
  31. 31. Amazon SimpleDB● Written in Erlang● Used with Amazon EC2 and Amazon S3● Easy access to lookup and query functions● Without support for the less used complex database functions● Do not need to pre-define data formats that will be stored● Scalable (with size limitations) – 10gb per domain, up to 250 domains● Fast/Reliable● Supports eventually consistent read and consistent read● Potentially Inexpensive 31
  32. 32. SimpleDB Data Modelhttp://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html 32
  33. 33. SimpleDB Data Model● Customer Account (amazon web services account)● Domains (similar to tables, or spreadsheet tabs)● Items (similar to rows)● Attributes (similar to columns)● Values (similar to cells) – Unlike a spreadsheet, however, multiple values can be associated with a cell● One domain can contain different types of data (some attributes not filled in) 33
  34. 34. SimpleDB API Summary● CreateDomain● DeleteDomain● ListDomains● PutAttributes● BatchPutAttributes● DeleteAttributes● BatchDeleteAttributes● GetAttributes● Select● DomainMetadata 34
  35. 35. Who uses SimpleDB?● Netflix● Other Amazon EC2 customers... 35
  36. 36. memcached● General purpose distributed memory caching system● Often used to cache in RAM that might otherwise be obtained from an external data source● LRU (when cache is full)● Can be distributed across multiple machines 36
  37. 37. Who uses memcached?● YouTube● Zynga● Facebook● Twitter 37
  38. 38. Terracotta● JVM in-memory distributed cache / store● The object store can be persistent● Distribution between nodes is handled through Terracotta server● Supports multiple Terracotta servers● Nodes only receive data they need/reference 38
  39. 39. Who uses Terracotta?● Sakai (thanks to John Wiley & Sons)● PartyGaming (PartyPoker.com)● Adobe● Pearson 39
  40. 40. Example Case Studies 40
  41. 41. Yahoo!● Hadoop – http://developer.yahoo.com/blogs/hadoop – More than 100,000 CPUs in >36,000 computers running Hadoop – Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop on larger clusters – >60% of Hadoop Jobs within Yahoo are Pig jobs 41
  42. 42. Twitter● How Twitter Uses NoSQL – http://goo.gl/Bwxoe● Scribe – Syslog stopped scaling● Hadoop – Needs to store more data per day than it can reliably write to a single hard drive● Pig – Used for interacting with Hadoop● Hbase – People Search● FlockDB – Social Graph Analysis 42
  43. 43. Netflix ● NoSQL at Netflix – http://goo.gl/SDcsZ ● SimpleDB – Highly durable, with writes automatically replicated across availability zones within a region – Love it when others do heavy lifting for us● Hadoop/HBase – Convenient, high-performance column-oriented distributed database solution – HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime● Cassandra – Adding more servers, without the need to re-shard 43
  44. 44. Facebook● http://goo.gl/J9EVW● 350 million users sending over 15 billion person-to-person messages per month● Chat service supports over 300 million users who send over 120 billion messages per month● Two patterns emerged – A short set of temporal data that tends to be volatile – An ever-growing set of data that rarely gets accessed● Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems – MySQL proved to not handle the long tail of data well (as indexes/data grows large performance suffers – Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. 44
  45. 45. “There is a learning curve and anoperational overhead. Still, the scalability,availability and performance advantages ofthe NoSQL persistence model are evidentand are paying for themselves already, andwill be central to our long-term cloudstrategy.” Yury Izrailevsky, Netflix 45
  46. 46. Questions & Answers Cris J. Holdorph Software Architect Unicon, Inc. Twitter: @holdorph holdorph@unicon.net www.unicon.net 46

×