No SQL Technologies

What Should I Know
about NoSQL?
Cris J. Holdorph
Software Architect
Unicon, Inc.

Jasig Conference
Westminster, CO
May 24, 2011

© Copyright Unicon, Inc., 2008. Some rights reserved. This work is licensed under a
Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/

Agenda
1. Definitions
2. History
3. Projects
4. Example Case Studies

4

Definitions
●
RDBMS
●
SQL
●
CRUD
●
ACID
– Atomicity, Consistency, Isolation, Durability
●
BASE
– Basically Available, Soft state, Eventual
consistency

6

Definitions
●
Big Data
●
Sharding
●
Cloud Computing
●
Distributed File System
●
Key Value Store

8

Map Reduce
●
Patented software framework introduced by Google
in 2004 to support distributed computing on large
data sets on clusters of computers.
●
Naming originally inspired by map and reduce
functions of functional programming (but their
purpose is not the same as it was there)
●
Map
– The master node takes the input, partitions it up into
smaller sub-problems, and distributes those to worker nodes
●
Reduce
– The master node then takes the answers to all the sub-
problems and combines them in some way to get the output
10

What does NoSQL Stand For?
●
NoSQL
●
No SQL
●
Not SQL
●
Not Only SQL
●
Not the RDBMS
●
Wikipedia:
– Carlo Strozzi used the term "NoSQL" in 1998 to
name his lightweight, open-source relational
database that did not expose an SQL interface.

11

History
●
Some techniques have existed for over 25
years
●
Teradata selling product for more then 20
years
●
RDBMS dates back to 1970

12

CAP Theorem
●
A conjecture made by Eric Brewer at the
Symposium on Principles of Distributed
Computing (2000)
●
States only possible to achieve 2 of 3
– Consistency (all nodes see the same data at the
same time)
– Availability (node failures do not prevent survivors
from continuing to operate)
– Partition Tolerance (the system continues to
operate despite arbitrary message loss)

13

CAP
●
Consistent and Available
– ACID systems, MySQL cluster, Oracle Coherence,
Drizzle
●
Consistent and Partition Tolerance
– SCLA (strongly consistent, loosely available)
– HBase, Bigtable
●
Available and Partition Tolerant
– BASE systems (CouchDB, SimpleDB, MongoDB
●
Cassandra (sits between SCLA/BASE
systems)
14

Hadoop
●
Open-source software for reliable, scalable,
distributed computing (Hadoop website)
– Hadoop Common
– HDFS
– MapReduce
●
Created Initially in early 2006 to support
search engine project Nutch
●
Inspired by the Google File System and
MapReduce papers (Oct 2003)

16

Hadoop Related Projects
●
Hbase
– A scalable, distributed database that supports
structured data storage for large tables
●
Hive
– A data warehouse infrastructure that provides
data summarization and ad hoc querying
●
Pig
– A high-level data-flow language and execution
framework for parallel computation
●
Cassandra
– uses Hadoop for MapReduce 17

Who Uses Hadoop
●
EBay (532 nodes, Search optimization)
●
Facebook (1100x8 node cluster, 300x8 node cluster, more on
this later)
●
GumGum (Ken Weiner, 20+ node cluster on Amazon EC2)
●
Hulu (log storage analysis)
●
Last.fm (44x2 nodes log analysis, 20x2 nodes profile analysis)
●
LinkedIn (120x2x4 nodes, 520x2x4 nodes, "People you may
know")
●
Twitter (more on this later)
●
Yahoo! (100,000 cpus running Hadoop, more on this later)

18

CouchDB
●
Apache open source document oriented database
written in Erlang (concurrent programming lang)
●
Designed to scale horizontally
●
Stores documents (one or more field value pairs
expressed as JSON)
●
ACID Semantics
●
Map/Reduce Views and Indexes (written in server
side javascript)
●
Bi-direction replication (with conflict resolution)
●
REST API

19

http://couchdb.apache.org/img/sketch.png

20

CouchDB Sample Document

"Subject": "I like Plankton"
"Author": "Rusty"
"PostedDate": "5/23/2006"
"Tags": ["plankton", "baseball", "decisions"]
"Body": "I decided today that I don't like baseball. I
like plankton."

http://couchdb.apache.org/docs/intro.html

21

Who uses CouchDB?
●
Ubuntu One – cloud storage service
– http://ubuntuone.com/
●
"I Play WoW" facebook app
– http://blog.socklabs.com/2008/12/24/iplaywow_monthly_actives.html

●
Wego - travel site
– http://www.wego.com/

22

Cassandra
●
Fault Tolerant (replication, failed nodes can
be replaced with no downtime)
●
Decentralized (ever node in cluster is
identical, no bottlenicks)
●
Supports either Synchronous or
Asynchronous update replication
●
Supports more then simple key/value pair
●
Elastic (read/write throughput increase
linearly as machines are added)
●
Durable (suitable for applictions that can't
23
afford to lose data)

Cassandra
●
Initially developed by Facebook for Inbox
Search (until replaced by HBase)
●
Key-value store where values can be multiple
values
●
Some inspiration from Amazon's Dynamo
(another key-value store)

24

Who uses Cassandra?
●
Facebook (previously)
●
Twitter
●
Digg
●
Cisco

25

MongoDB
●
Name is derived from "humongous"
●
Document oriented database written in C++
●
Manages collections of JSON-like documents
●
Binaries available for windows, linux, OS X,
Solaris
●
Supports dates, regular expressions code,
binary data (all BSON types)
●
Cursors for query results
●
Any field can be queried at any time
26

MongoDB
●
Queries can include user-defined JavaScript
functions
●
Master/Slave (only master supports writes,
slaves can be read from)
●
Scales horizontally using sharding
●
Support for Map/Reduce

27

Who uses MongoDB?
●
New York Times
●
Shutterfly
●
Foursquare
●
SourceForge
●
Intuit

28

Google Big Table
●
Built on GFS (Google File System)
●
Can be used with Google App Engine
●
Maps two aribtrary strings and a timestamp
●
Designed to scale into the petabyte range
●
Designed to scale across hundreds or
thousands of machines
●
Portions of a table (tablets) can be
compressed
●
HBase was modeled after BigTable
29

Who uses Big Table?
●
Google Reader
●
Google Maps
●
Google Book Search
●
Google Earth
●
Blogger.com
●
Google Code
●
Orkut
●
YouTube
●
Gmail 30

Amazon SimpleDB
●
Written in Erlang
●
Used with Amazon EC2 and Amazon S3
●
Easy access to lookup and query functions
●
Without support for the less used complex database
functions
●
Do not need to pre-define data formats that will be stored
●
Scalable (with size limitations)
– 10gb per domain, up to 250 domains
●
Fast/Reliable
●
Supports eventually consistent read and consistent read
●
Potentially Inexpensive
31

SimpleDB Data Model

http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataModel.html
32

SimpleDB Data Model
●
Customer Account (amazon web services account)
●
Domains (similar to tables, or spreadsheet tabs)
●
Items (similar to rows)
●
Attributes (similar to columns)
●
Values (similar to cells)
– Unlike a spreadsheet, however, multiple values can be
associated with a cell
●
One domain can contain different types of data
(some attributes not filled in)

33

SimpleDB API Summary
●
CreateDomain
●
DeleteDomain
●
ListDomains
●
PutAttributes
●
BatchPutAttributes
●
DeleteAttributes
●
BatchDeleteAttributes
●
GetAttributes
●
Select
●
DomainMetadata 34

Who uses SimpleDB?
●
Netflix
●
Other Amazon EC2 customers...

35

memcached
●
General purpose distributed memory caching system
●
Often used to cache in RAM that might otherwise be
obtained from an external data source
●
LRU (when cache is full)
●
Can be distributed across multiple machines

36

Who uses memcached?
●
YouTube
●
Zynga
●
Facebook
●
Twitter

37

Terracotta
●
JVM in-memory distributed cache / store
●
The object store can be persistent
●
Distribution between nodes is handled through
Terracotta server
●
Supports multiple Terracotta servers
●
Nodes only receive data they need/reference

38

Who uses Terracotta?
●
Sakai (thanks to John Wiley & Sons)
●
PartyGaming (PartyPoker.com)
●
Adobe
●
Pearson

39

Example Case Studies

40

Yahoo!
●
Hadoop
– http://developer.yahoo.com/blogs/hadoop
– More than 100,000 CPUs in >36,000 computers
running Hadoop
– Our biggest cluster: 4000 nodes (2*4cpu boxes w
4*1TB disk & 16GB RAM)
– Used to support research for Ad Systems and Web
Search
– Also used to do scaling tests to support
development of Hadoop on larger clusters
– >60% of Hadoop Jobs within Yahoo are Pig jobs
41

Twitter
●
How Twitter Uses NoSQL
– http://goo.gl/Bwxoe
●
Scribe
– Syslog stopped scaling
●
Hadoop
– Needs to store more data per day than it can reliably write to a
single hard drive
●
Pig
– Used for interacting with Hadoop
●
Hbase
– People Search
●
FlockDB
– Social Graph Analysis 42

Netflix
●
NoSQL at Netflix
– http://goo.gl/SDcsZ
●
SimpleDB
– Highly durable, with writes automatically replicated across
availability zones within a region
– Love it when others do heavy lifting for us
●
Hadoop/HBase
– Convenient, high-performance column-oriented distributed
database solution
– HBase makes it really easy to grow your cluster and re-distribute
load across nodes at runtime
●
Cassandra
– Adding more servers, without the need to re-shard
43

Facebook
●
http://goo.gl/J9EVW
●
350 million users sending over 15 billion person-to-person messages
per month
●
Chat service supports over 300 million users who send over 120 billion
messages per month
●
Two patterns emerged
– A short set of temporal data that tends to be volatile
– An ever-growing set of data that rarely gets accessed
●
Evaluate clusters of MySQL, Apache Cassandra, Apache HBase, and a
couple of other systems
– MySQL proved to not handle the long tail of data well (as
indexes/data grows large performance suffers
– Cassandra's eventual consistency model to be a difficult pattern to
reconcile for our new Messages infrastructure.
44

“There is a learning curve and an
operational overhead. Still, the scalability,
availability and performance advantages of
the NoSQL persistence model are evident
and are paying for themselves already, and
will be central to our long-term cloud
strategy.”
Yury Izrailevsky, Netflix

45

Questions & Answers

Cris J. Holdorph
Software Architect
Unicon, Inc.

Twitter: @holdorph

holdorph@unicon.net
www.unicon.net 46

No SQL Technologies

More Related Content

What's hot

Similar to No SQL Technologies

More from Cris Holdorph

Recently uploaded

No SQL Technologies