• Save
No SQL Technologies
Upcoming SlideShare
Loading in...5
×
 

No SQL Technologies

on

  • 2,114 views

Introduction to NoSQL Technologies

Introduction to NoSQL Technologies

Statistics

Views

Total Views
2,114
Views on SlideShare
2,114
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

No SQL Technologies No SQL Technologies Presentation Transcript

  • What Should I Know about NoSQL? Cris J. Holdorph Software Architect Unicon, Inc. Jasig Conference Westminster, CO May 24, 2011© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under aCreative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/
  • Lethal SQL 2
  • 3
  • Agenda1. Definitions2. History3. Projects4. Example Case Studies 4
  • Definitions 5
  • Definitions● RDBMS● SQL● CRUD● ACID – Atomicity, Consistency, Isolation, Durability● BASE – Basically Available, Soft state, Eventual consistency 6
  • 7
  • Definitions● Big Data● Sharding● Cloud Computing● Distributed File System● Key Value Store 8
  • History 9
  • Map Reduce● Patented software framework introduced by Google in 2004 to support distributed computing on large data sets on clusters of computers.● Naming originally inspired by map and reduce functions of functional programming (but their purpose is not the same as it was there)● Map – The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes● Reduce – The master node then takes the answers to all the sub- problems and combines them in some way to get the output 10
  • What does NoSQL Stand For?● NoSQL● No SQL● Not SQL● Not Only SQL● Not the RDBMS● Wikipedia: – Carlo Strozzi used the term "NoSQL" in 1998 to name his lightweight, open-source relational database that did not expose an SQL interface. 11
  • History● Some techniques have existed for over 25 years● Teradata selling product for more then 20 years● RDBMS dates back to 1970 12
  • CAP Theorem● A conjecture made by Eric Brewer at the Symposium on Principles of Distributed Computing (2000)● States only possible to achieve 2 of 3 – Consistency (all nodes see the same data at the same time) – Availability (node failures do not prevent survivors from continuing to operate) – Partition Tolerance (the system continues to operate despite arbitrary message loss) 13
  • CAP● Consistent and Available – ACID systems, MySQL cluster, Oracle Coherence, Drizzle● Consistent and Partition Tolerance – SCLA (strongly consistent, loosely available) – HBase, Bigtable● Available and Partition Tolerant – BASE systems (CouchDB, SimpleDB, MongoDB● Cassandra (sits between SCLA/BASE systems) 14
  • Projects 15
  • Hadoop● Open-source software for reliable, scalable, distributed computing (Hadoop website) – Hadoop Common – HDFS – MapReduce● Created Initially in early 2006 to support search engine project Nutch● Inspired by the Google File System and MapReduce papers (Oct 2003) 16
  • Hadoop Related Projects● Hbase – A scalable, distributed database that supports structured data storage for large tables● Hive – A data warehouse infrastructure that provides data summarization and ad hoc querying● Pig – A high-level data-flow language and execution framework for parallel computation● Cassandra – uses Hadoop for MapReduce 17
  • Who Uses Hadoop● EBay (532 nodes, Search optimization)● Facebook (1100x8 node cluster, 300x8 node cluster, more on this later)● GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)● Hulu (log storage analysis)● Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis)● LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may know")● Twitter (more on this later)● Yahoo! (100,000 cpus running Hadoop, more on this later) 18
  • CouchDB● Apache open source document oriented database written in Erlang (concurrent programming lang)● Designed to scale horizontally● Stores documents (one or more field value pairs expressed as JSON)● ACID Semantics● Map/Reduce Views and Indexes (written in server side javascript)● Bi-direction replication (with conflict resolution)● REST API 19
  • http://couchdb.apache.org/img/sketch.png 20
  • CouchDB Sample Document"Subject": "I like Plankton""Author": "Rusty""PostedDate": "5/23/2006""Tags": ["plankton", "baseball", "decisions"]"Body": "I decided today that I dont like baseball. Ilike plankton." http://couchdb.apache.org/docs/intro.html 21
  • Who uses CouchDB?● Ubuntu One – cloud storage service – http://ubuntuone.com/● "I Play WoW" facebook app – http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html● Wego - travel site – http://www.wego.com/ 22
  • Cassandra● Fault Tolerant (replication, failed nodes can be replaced with no downtime)● Decentralized (ever node in cluster is identical, no bottlenicks)● Supports either Synchronous or Asynchronous update replication● Supports more then simple key/value pair● Elastic (read/write throughput increase linearly as machines are added)● Durable (suitable for applictions that cant 23 afford to lose data)
  • Cassandra● Initially developed by Facebook for Inbox Search (until replaced by HBase)● Key-value store where values can be multiple values● Some inspiration from Amazons Dynamo (another key-value store) 24
  • Who uses Cassandra?● Facebook (previously)● Twitter● Digg● Cisco 25
  • MongoDB● Name is derived from "humongous"● Document oriented database written in C++● Manages collections of JSON-like documents● Binaries available for windows, linux, OS X, Solaris● Supports dates, regular expressions code, binary data (all BSON types)● Cursors for query results● Any field can be queried at any time 26
  • MongoDB● Queries can include user-defined JavaScript functions● Master/Slave (only master supports writes, slaves can be read from)● Scales horizontally using sharding● Support for Map/Reduce 27
  • Who uses MongoDB?● New York Times● Shutterfly● Foursquare● SourceForge● Intuit 28
  • Google Big Table● Built on GFS (Google File System)● Can be used with Google App Engine● Maps two aribtrary strings and a timestamp● Designed to scale into the petabyte range● Designed to scale across hundreds or thousands of machines● Portions of a table (tablets) can be compressed● HBase was modeled after BigTable 29
  • Who uses Big Table?● Google Reader● Google Maps● Google Book Search● Google Earth● Blogger.com● Google Code● Orkut● YouTube● Gmail 30
  • Amazon SimpleDB● Written in Erlang● Used with Amazon EC2 and Amazon S3● Easy access to lookup and query functions● Without support for the less used complex database functions● Do not need to pre-define data formats that will be stored● Scalable (with size limitations) – 10gb per domain, up to 250 domains● Fast/Reliable● Supports eventually consistent read and consistent read● Potentially Inexpensive 31
  • SimpleDB Data Modelhttp://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html 32
  • SimpleDB Data Model● Customer Account (amazon web services account)● Domains (similar to tables, or spreadsheet tabs)● Items (similar to rows)● Attributes (similar to columns)● Values (similar to cells) – Unlike a spreadsheet, however, multiple values can be associated with a cell● One domain can contain different types of data (some attributes not filled in) 33
  • SimpleDB API Summary● CreateDomain● DeleteDomain● ListDomains● PutAttributes● BatchPutAttributes● DeleteAttributes● BatchDeleteAttributes● GetAttributes● Select● DomainMetadata 34
  • Who uses SimpleDB?● Netflix● Other Amazon EC2 customers... 35
  • memcached● General purpose distributed memory caching system● Often used to cache in RAM that might otherwise be obtained from an external data source● LRU (when cache is full)● Can be distributed across multiple machines 36
  • Who uses memcached?● YouTube● Zynga● Facebook● Twitter 37
  • Terracotta● JVM in-memory distributed cache / store● The object store can be persistent● Distribution between nodes is handled through Terracotta server● Supports multiple Terracotta servers● Nodes only receive data they need/reference 38
  • Who uses Terracotta?● Sakai (thanks to John Wiley & Sons)● PartyGaming (PartyPoker.com)● Adobe● Pearson 39
  • Example Case Studies 40
  • Yahoo!● Hadoop – http://developer.yahoo.com/blogs/hadoop – More than 100,000 CPUs in >36,000 computers running Hadoop – Our biggest cluster: 4000 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM) – Used to support research for Ad Systems and Web Search – Also used to do scaling tests to support development of Hadoop on larger clusters – >60% of Hadoop Jobs within Yahoo are Pig jobs 41
  • Twitter● How Twitter Uses NoSQL – http://goo.gl/Bwxoe● Scribe – Syslog stopped scaling● Hadoop – Needs to store more data per day than it can reliably write to a single hard drive● Pig – Used for interacting with Hadoop● Hbase – People Search● FlockDB – Social Graph Analysis 42
  • Netflix ● NoSQL at Netflix – http://goo.gl/SDcsZ ● SimpleDB – Highly durable, with writes automatically replicated across availability zones within a region – Love it when others do heavy lifting for us● Hadoop/HBase – Convenient, high-performance column-oriented distributed database solution – HBase makes it really easy to grow your cluster and re-distribute load across nodes at runtime● Cassandra – Adding more servers, without the need to re-shard 43
  • Facebook● http://goo.gl/J9EVW● 350 million users sending over 15 billion person-to-person messages per month● Chat service supports over 300 million users who send over 120 billion messages per month● Two patterns emerged – A short set of temporal data that tends to be volatile – An ever-growing set of data that rarely gets accessed● Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a couple of other systems – MySQL proved to not handle the long tail of data well (as indexes/data grows large performance suffers – Cassandras eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. 44
  • “There is a learning curve and anoperational overhead. Still, the scalability,availability and performance advantages ofthe NoSQL persistence model are evidentand are paying for themselves already, andwill be central to our long-term cloudstrategy.” Yury Izrailevsky, Netflix 45
  • Questions & Answers Cris J. Holdorph Software Architect Unicon, Inc. Twitter: @holdorph holdorph@unicon.net www.unicon.net 46