• Save
No SQL Technologies
Upcoming SlideShare
Loading in...5

No SQL Technologies



Introduction to NoSQL Technologies

Introduction to NoSQL Technologies



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

No SQL Technologies No SQL Technologies Presentation Transcript

  • What Should I Know about NoSQL? Cris J. Holdorph Software Architect Unicon, Inc. Jasig Conference Westminster, CO May 24, 2011© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under aCreative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/
  • Lethal SQL 2
  • 3
  • Agenda1. Definitions2. History3. Projects4. Example Case Studies 4
  • Definitions 5
  • Definitions● RDBMS● SQL● CRUD● ACID – Atomicity, Consistency, Isolation, Durability● BASE – Basically Available, Soft state, Eventual consistency 6
  • 7
  • Definitions● Big Data● Sharding● Cloud Computing● Distributed File System● Key Value Store 8
  • History 9
  • Map Reduce● Patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.● Naming originally inspired by map and reduce functions of functional programming (but their purpose is not the same as it was there)● Map – The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes● Reduce – The master node then takes the answers to all the sub- problems and combines them in some way to get the output 10
  • What does NoSQL Stand For?● NoSQL● No SQL● Not SQL● Not Only SQL● Not the RDBMS● Wikipedia: – Carlo Strozzi used the term "NoSQL" in 1998 to name his lightweight, open-source relational database that did not expose an SQL interface. 11
  • History● Some techniques have existed for over 25 years● Teradata selling product for more then 20 years● RDBMS dates back to 1970 12
  • CAP Theorem● A conjecture made by Eric Brewer at the Symposium on Principles of Distributed Computing (2000)● States only possible to achieve 2 of 3 – Consistency (all nodes see the same data at the same time) – Availability (node failures do not prevent survivors from continuing to operate) – Partition Tolerance (the system continues to operate despite arbitrary message loss) 13
  • CAP● Consistent and Available – ACID systems, MySQL cluster, Oracle Coherence, Drizzle● Consistent and Partition Tolerance – SCLA (strongly consistent, loosely available) – HBase, Bigtable● Available and Partition Tolerant – BASE systems (CouchDB, SimpleDB, MongoDB● Cassandra (sits between SCLA/BASE systems) 14
  • Projects 15
  • Hadoop● Open-source software for reliable, scalable, distributed computing (Hadoop website) – Hadoop Common – HDFS – MapReduce● Created Initially in early 2006 to support search engine project Nutch● Inspired by the Google File System and MapReduce papers (Oct 2003) 16
  • Hadoop Related Projects● Hbase – A scalable, distributed database that supports structured data storage for large tables● Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying● Pig – A high-level data-flow language and execution framework for parallel computation● Cassandra – uses Hadoop for MapReduce 17
  • Who Uses Hadoop● EBay (532 nodes, Search optimization)● Facebook (1100x8 node cluster, 300x8 node cluster, more on this later)● GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)● Hulu (log storage analysis)● Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis)● LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may know")● Twitter (more on this later)● Yahoo! (100,000 cpus running Hadoop, more on this later) 18
  • CouchDB● Apache open source document oriented database written in Erlang (concurrent programming lang)● Designed to scale horizontally● Stores documents (one or more field value pairs expressed as JSON)● ACID Semantics● Map/Reduce Views and Indexes (written in server side javascript)● Bi-direction replication (with conflict resolution)● REST API 19
  • http://couchdb.apache.org/img/sketch.png 20
  • CouchDB Sample Document"Subject": "I like Plankton""Author": "Rusty""PostedDate": "5/23/2006""Tags": ["plankton", "baseball", "decisions"]"Body": "I decided today that I dont like baseball. Ilike plankton." http://couchdb.apache.org/docs/intro.html 21
  • Who uses CouchDB?● Ubuntu One – cloud storage service – http://ubuntuone.com/● "I Play WoW" facebook app – http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html● Wego - travel site – http://www.wego.com/ 22
  • Cassandra● Fault Tolerant (replication, failed nodes can be replaced with no downtime)● Decentralized (ever node in cluster is identical, no bottlenicks)● Supports either Synchronous or Asynchronous update replication● Supports more then simple key/value pair● Elastic (read/write throughput increase linearly as machines are added)● Durable (suitable for applictions that cant 23 afford to lose data)
  • Cassandra● Initially developed by Facebook for Inbox Search (until replaced by HBase)● Key-value store where values can be multiple values● Some inspiration from Amazons Dynamo (another key-value store) 24
  • Who uses Cassandra?● Facebook (previously)● Twitter● Digg● Cisco 25
  • MongoDB● Name is derived from "humongous"● Document oriented database written in C++● Manages collections of JSON-like documents● Binaries available for windows, linux, OS X, Solaris● Supports dates, regular expressions code, binary data (all BSON types)● Cursors for query results● Any field can be queried at any time 26
  • MongoDB● Queries can include user-defined JavaScript functions● Master/Slave (only master supports writes, slaves can be read from)● Scales horizontally using sharding● Support for Map/Reduce 27
  • Who uses MongoDB?● New York Times● Shutterfly● Foursquare● SourceForge● Intuit 28
  • Google Big Table● Built on GFS (Google File System)● Can be used with Google App Engine● Maps two aribtrary strings and a timestamp● Designed to scale into the petabyte range● Designed to scale across hundreds or thousands of machines● Portions of a table (tablets) can be compressed● HBase was modeled after BigTable 29
  • Who uses Big Table?● Google Reader● Google Maps● Google Book Search● Google Earth● Blogger.com● Google Code● Orkut● YouTube● Gmail 30
  • Amazon SimpleDB● Written in Erlang● Used with Amazon EC2 and Amazon S3● Easy access to lookup and query functions● Without support for the less used complex database functions● Do not need to pre-define data formats that will be stored● Scalable (with size limitations) – 10gb per domain, up to 250 domains● Fast/Reliable● Supports eventually consistent read and consistent read● Potentially Inexpensive 31
  • SimpleDB Data Modelhttp://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html 32
  • SimpleDB Data Model● Customer Account (amazon web services account)● Domains (similar to tables, or spreadsheet tabs)● Items (similar to rows)● Attributes (similar to columns)● Values (similar to cells) – Unlike a spreadsheet, however, multiple values can be associated with a cell● One domain can contain different types of data (some attributes not filled in) 33
  • SimpleDB API Summary● CreateDomain● DeleteDomain● ListDomains● PutAttributes● BatchPutAttributes● DeleteAttributes● BatchDeleteAttributes● GetAttributes● Select● DomainMetadata 34
  • Who uses SimpleDB?● Netflix● Other Amazon EC2 customers... 35
  • memcached● General purpose distributed memory caching system● Often used to cache in RAM that might otherwise be obtained from an external data source● LRU (when cache is full)● Can be distributed across multiple machines 36
  • Who uses memcached?● YouTube● Zynga● Facebook● Twitter 37
  • Terracotta● JVM in-memory distributed cache / store● The object store can be persistent● Distribution between nodes is handled through Terracotta server● Supports multiple Terracotta servers● Nodes only receive data they need/reference 38
  • Who uses Terracotta?● Sakai (thanks to John Wiley & Sons)● PartyGaming (PartyPoker.com)● Adobe● Pearson 39
  • Example Case Studies 40
  • Yahoo!● Hadoop – http://developer.yahoo.com/blogs/hadoop – More than 100,000 CPUs in >36,000 computers running Hadoop – Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop on larger clusters – >60% of Hadoop Jobs within Yahoo are Pig jobs 41
  • Twitter● How Twitter Uses NoSQL – http://goo.gl/Bwxoe● Scribe – Syslog stopped scaling● Hadoop – Needs to store more data per day than it can reliably write to a single hard drive● Pig – Used for interacting with Hadoop● Hbase – People Search● FlockDB – Social Graph Analysis 42
  • Netflix ● NoSQL at Netflix – http://goo.gl/SDcsZ ● SimpleDB – Highly durable, with writes automatically replicated across availability zones within a region – Love it when others do heavy lifting for us● Hadoop/HBase – Convenient, high-performance column-oriented distributed database solution – HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime● Cassandra – Adding more servers, without the need to re-shard 43
  • Facebook● http://goo.gl/J9EVW● 350 million users sending over 15 billion person-to-person messages per month● Chat service supports over 300 million users who send over 120 billion messages per month● Two patterns emerged – A short set of temporal data that tends to be volatile – An ever-growing set of data that rarely gets accessed● Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems – MySQL proved to not handle the long tail of data well (as indexes/data grows large performance suffers – Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. 44
  • “There is a learning curve and anoperational overhead. Still, the scalability,availability and performance advantages ofthe NoSQL persistence model are evidentand are paying for themselves already, andwill be central to our long-term cloudstrategy.” Yury Izrailevsky, Netflix 45
  • Questions & Answers Cris J. Holdorph Software Architect Unicon, Inc. Twitter: @holdorph holdorph@unicon.net www.unicon.net 46