• Save
No SQL Technologies
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

No SQL Technologies

on

  • 2,168 views

Introduction to NoSQL Technologies

Introduction to NoSQL Technologies

Statistics

Views

Total Views
2,168
Views on SlideShare
2,168
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

No SQL Technologies Presentation Transcript

  • 1. What Should I Know about NoSQL? Cris J. Holdorph Software Architect Unicon, Inc. Jasig Conference Westminster, CO May 24, 2011© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under aCreative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/
  • 2. Lethal SQL 2
  • 3. 3
  • 4. Agenda1. Definitions2. History3. Projects4. Example Case Studies 4
  • 5. Definitions 5
  • 6. Definitions● RDBMS● SQL● CRUD● ACID – Atomicity, Consistency, Isolation, Durability● BASE – Basically Available, Soft state, Eventual consistency 6
  • 7. 7
  • 8. Definitions● Big Data● Sharding● Cloud Computing● Distributed File System● Key Value Store 8
  • 9. History 9
  • 10. Map Reduce● Patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.● Naming originally inspired by map and reduce functions of functional programming (but their purpose is not the same as it was there)● Map – The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes● Reduce – The master node then takes the answers to all the sub- problems and combines them in some way to get the output 10
  • 11. What does NoSQL Stand For?● NoSQL● No SQL● Not SQL● Not Only SQL● Not the RDBMS● Wikipedia: – Carlo Strozzi used the term "NoSQL" in 1998 to name his lightweight, open-source relational database that did not expose an SQL interface. 11
  • 12. History● Some techniques have existed for over 25 years● Teradata selling product for more then 20 years● RDBMS dates back to 1970 12
  • 13. CAP Theorem● A conjecture made by Eric Brewer at the Symposium on Principles of Distributed Computing (2000)● States only possible to achieve 2 of 3 – Consistency (all nodes see the same data at the same time) – Availability (node failures do not prevent survivors from continuing to operate) – Partition Tolerance (the system continues to operate despite arbitrary message loss) 13
  • 14. CAP● Consistent and Available – ACID systems, MySQL cluster, Oracle Coherence, Drizzle● Consistent and Partition Tolerance – SCLA (strongly consistent, loosely available) – HBase, Bigtable● Available and Partition Tolerant – BASE systems (CouchDB, SimpleDB, MongoDB● Cassandra (sits between SCLA/BASE systems) 14
  • 15. Projects 15
  • 16. Hadoop● Open-source software for reliable, scalable, distributed computing (Hadoop website) – Hadoop Common – HDFS – MapReduce● Created Initially in early 2006 to support search engine project Nutch● Inspired by the Google File System and MapReduce papers (Oct 2003) 16
  • 17. Hadoop Related Projects● Hbase – A scalable, distributed database that supports structured data storage for large tables● Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying● Pig – A high-level data-flow language and execution framework for parallel computation● Cassandra – uses Hadoop for MapReduce 17
  • 18. Who Uses Hadoop● EBay (532 nodes, Search optimization)● Facebook (1100x8 node cluster, 300x8 node cluster, more on this later)● GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)● Hulu (log storage analysis)● Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis)● LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may know")● Twitter (more on this later)● Yahoo! (100,000 cpus running Hadoop, more on this later) 18
  • 19. CouchDB● Apache open source document oriented database written in Erlang (concurrent programming lang)● Designed to scale horizontally● Stores documents (one or more field value pairs expressed as JSON)● ACID Semantics● Map/Reduce Views and Indexes (written in server side javascript)● Bi-direction replication (with conflict resolution)● REST API 19
  • 20. http://couchdb.apache.org/img/sketch.png 20
  • 21. CouchDB Sample Document"Subject": "I like Plankton""Author": "Rusty""PostedDate": "5/23/2006""Tags": ["plankton", "baseball", "decisions"]"Body": "I decided today that I dont like baseball. Ilike plankton." http://couchdb.apache.org/docs/intro.html 21
  • 22. Who uses CouchDB?● Ubuntu One – cloud storage service – http://ubuntuone.com/● "I Play WoW" facebook app – http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html● Wego - travel site – http://www.wego.com/ 22
  • 23. Cassandra● Fault Tolerant (replication, failed nodes can be replaced with no downtime)● Decentralized (ever node in cluster is identical, no bottlenicks)● Supports either Synchronous or Asynchronous update replication● Supports more then simple key/value pair● Elastic (read/write throughput increase linearly as machines are added)● Durable (suitable for applictions that cant 23 afford to lose data)
  • 24. Cassandra● Initially developed by Facebook for Inbox Search (until replaced by HBase)● Key-value store where values can be multiple values● Some inspiration from Amazons Dynamo (another key-value store) 24
  • 25. Who uses Cassandra?● Facebook (previously)● Twitter● Digg● Cisco 25
  • 26. MongoDB● Name is derived from "humongous"● Document oriented database written in C++● Manages collections of JSON-like documents● Binaries available for windows, linux, OS X, Solaris● Supports dates, regular expressions code, binary data (all BSON types)● Cursors for query results● Any field can be queried at any time 26
  • 27. MongoDB● Queries can include user-defined JavaScript functions● Master/Slave (only master supports writes, slaves can be read from)● Scales horizontally using sharding● Support for Map/Reduce 27
  • 28. Who uses MongoDB?● New York Times● Shutterfly● Foursquare● SourceForge● Intuit 28
  • 29. Google Big Table● Built on GFS (Google File System)● Can be used with Google App Engine● Maps two aribtrary strings and a timestamp● Designed to scale into the petabyte range● Designed to scale across hundreds or thousands of machines● Portions of a table (tablets) can be compressed● HBase was modeled after BigTable 29
  • 30. Who uses Big Table?● Google Reader● Google Maps● Google Book Search● Google Earth● Blogger.com● Google Code● Orkut● YouTube● Gmail 30
  • 31. Amazon SimpleDB● Written in Erlang● Used with Amazon EC2 and Amazon S3● Easy access to lookup and query functions● Without support for the less used complex database functions● Do not need to pre-define data formats that will be stored● Scalable (with size limitations) – 10gb per domain, up to 250 domains● Fast/Reliable● Supports eventually consistent read and consistent read● Potentially Inexpensive 31
  • 32. SimpleDB Data Modelhttp://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html 32
  • 33. SimpleDB Data Model● Customer Account (amazon web services account)● Domains (similar to tables, or spreadsheet tabs)● Items (similar to rows)● Attributes (similar to columns)● Values (similar to cells) – Unlike a spreadsheet, however, multiple values can be associated with a cell● One domain can contain different types of data (some attributes not filled in) 33
  • 34. SimpleDB API Summary● CreateDomain● DeleteDomain● ListDomains● PutAttributes● BatchPutAttributes● DeleteAttributes● BatchDeleteAttributes● GetAttributes● Select● DomainMetadata 34
  • 35. Who uses SimpleDB?● Netflix● Other Amazon EC2 customers... 35
  • 36. memcached● General purpose distributed memory caching system● Often used to cache in RAM that might otherwise be obtained from an external data source● LRU (when cache is full)● Can be distributed across multiple machines 36
  • 37. Who uses memcached?● YouTube● Zynga● Facebook● Twitter 37
  • 38. Terracotta● JVM in-memory distributed cache / store● The object store can be persistent● Distribution between nodes is handled through Terracotta server● Supports multiple Terracotta servers● Nodes only receive data they need/reference 38
  • 39. Who uses Terracotta?● Sakai (thanks to John Wiley & Sons)● PartyGaming (PartyPoker.com)● Adobe● Pearson 39
  • 40. Example Case Studies 40
  • 41. Yahoo!● Hadoop – http://developer.yahoo.com/blogs/hadoop – More than 100,000 CPUs in >36,000 computers running Hadoop – Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop on larger clusters – >60% of Hadoop Jobs within Yahoo are Pig jobs 41
  • 42. Twitter● How Twitter Uses NoSQL – http://goo.gl/Bwxoe● Scribe – Syslog stopped scaling● Hadoop – Needs to store more data per day than it can reliably write to a single hard drive● Pig – Used for interacting with Hadoop● Hbase – People Search● FlockDB – Social Graph Analysis 42
  • 43. Netflix ● NoSQL at Netflix – http://goo.gl/SDcsZ ● SimpleDB – Highly durable, with writes automatically replicated across availability zones within a region – Love it when others do heavy lifting for us● Hadoop/HBase – Convenient, high-performance column-oriented distributed database solution – HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime● Cassandra – Adding more servers, without the need to re-shard 43
  • 44. Facebook● http://goo.gl/J9EVW● 350 million users sending over 15 billion person-to-person messages per month● Chat service supports over 300 million users who send over 120 billion messages per month● Two patterns emerged – A short set of temporal data that tends to be volatile – An ever-growing set of data that rarely gets accessed● Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems – MySQL proved to not handle the long tail of data well (as indexes/data grows large performance suffers – Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. 44
  • 45. “There is a learning curve and anoperational overhead. Still, the scalability,availability and performance advantages ofthe NoSQL persistence model are evidentand are paying for themselves already, andwill be central to our long-term cloudstrategy.” Yury Izrailevsky, Netflix 45
  • 46. Questions & Answers Cris J. Holdorph Software Architect Unicon, Inc. Twitter: @holdorph holdorph@unicon.net www.unicon.net 46