Big Data (NJ SQL Server User Group)

Introduction to Big Data
and NoSQL
NJ SQL Server User Group
May 15, 2012
Melissa Demsak Don Demsak
SQL Architect Advisory Solutions Architect
Realogy EMC Consulting
www.sqldiva.com www.donxml.com

Meet Melissa

• SQL Architect
– Realogy
• SqlDiva, Twitter: sqldiva
• Email – melissa@sqldiva.com

Meet Don

• Advisory Solutions Architect
– EMC Consulting
• Application Architecture, Development & Design
• DonXml.com, Twitter: donxml
• Email – don@donxml.com
• SlideShare - http://www.slideshare.net/dondemsak

How did we get here?
• Expensive • Culture of Limitations
o Processors o Limit CPU cycles
o Disk space o Limit disk space
o Memory o Limit memory
o Operating Systems o Limited OS Development
o Software o Limited Software
o Programmers o Programmers
• One language
• One persistence store

Typical RDBMS Implementations
• Fixed table schemas
• Small but frequent reads/writes
• Large batch transactions
• Focus on ACID
o Atomicity
o Consistency
o Isolation
o Durability

How we scale RDBMS
implementations

1 st
Step – Build a
relational database

Relational
Database

2 nd
Step – Table
Partitioning
p1 p2 p3

Relational
Database

3 rd Step – Database
Partitioning
Browser Web Tier B/L Tier Relational
Database
Customer #1

Database
Customer #2

Database
Customer #3

4 th Step – Move to the
cloud?
Browser Web Tier B/L Tier SQL Azure
Federation
Customer #1

SQL Azure
Browser Web Tier B/L Tier Federation

Customer #2

SQL Azure
Browser Web Tier B/L Tier Federation

Customer #3

Problems created by too
much data
• Where to store
• How to store
• How to process
• Organization, searching, and
metadata
• How to manage access
• How to copy, move, and backup
• Lifecycle

Polyglot Persistence

(how to store)

• Atlanta 2009 - No:sql(east) conference
select fun, profit from real_world
where relational=false
• Billed as “conference of no-rel
datastores”

(loose) Definition

• (often) Open source
• Non-relational
• Distributed
• (often) does not guarantee ACID

5 Groups of Data
Models
Relational

Document

Key Value

Graph

Column Family

Document?
• Think of a web page...
o Relational model requires column/tag
o Lots of empty columns
o Wasted space and processing time

• Document model just stores the pages as is
o Saves on space
o Very flexible

• Document Databases
o Apache Jackrabbit
o CouchDB
o MongoDB
o SimpleDB
o XML Databases
• MarkLogic Server
• eXist.

Key/Value Stores
• Simple Index on Key
• Value can be any serialized form of data
• Lots of different implementations
o Eventually Consistent
• “If no updates occur for a period, eventually all updates will propagate
through the system and all replicas will be consistent”
o Cached in RAM
o Cached on disk
o Distributed Hash Tables

• Examples
o Azure AppFabric Cache
o Memcache-d
o VMWare vFabric GemFire

Graph?
• Graph consists of
o Node („stations‟ of the graph)
o Edges (lines between them)

• Graph Stores
o AllegroGraph
o Core Data
o Neo4j
o DEX
o FlockDB
• Created by the Twitter folks
• Nodes = Users
• Edges = Nature of relationship between nodes.
o Microsoft Trinity (research project)
• http://research.microsoft.com/en-us/projects/trinity/

Column Family?
• Lots of variants
o Object Stores
• Db4o
• GemStone/S
• InterSystems Caché
• Objectivity/DB
• ZODB
o Tabluar
• BigTable
• Mnesia
• Hbase
• Hypertable
• Azure Table Storage
o Column-oriented
• Greenplum
• Microsoft SQL Server 2012

Okay got it, Now Let’s
Compare Some Real World
Scenarios

You Need Constant
Consistency
• You‟re dealing with financial transactions
• You‟re dealing with medical records
• You‟re dealing with bonded goods
• Best you use a RDMBS 

Footer Text 5/15/2012 24

You Need Horizontal
Scalability
• You‟re working across defined timezones
• You‟re Aggregating large quantities of data
• Maintaining a chat server (Facebook chat)
• Use Column Family Storage.


Frequently Written Rarely
Read
• Think web counters and the like
• Every time a user comes to a page = ctr++
• But it‟s only read when the report is run
• Use Key-Value Storage.


Here Today Gone
Tomorrow
• Transient data like..
o Web Sessions
o Locks
o Short Term Stats
• Shopping cart contents

• Use Key-Value Storage


Where to store
• RAM
o Fast
• Local Disk
o SSD – super fast
o Expensive
o Fast spinning disks (7200+)
o volatile
o High Bandwidth possible
o Persistent

• SAN
• Parallel File System o Storage Area Network
o HDFS (Hadoop) o Fully managed
o Auto-replicated for o Expensive
parallel decentralized
I/O • Cloud
o Amazon
o Box.Net
o DropBox

Big Data Definition
•Beyond what traditional
Volume environments can handle

•Need decisions fast
Velocity

•Many formats
Variety

Additional Big Data Concepts
• Volumes & volumes of data
• Unstructured
• Semi-structured
• Not suited for Relational Databases
• Often utilizes MapReduce frameworks

Big Data Examples
• Cassandra
• Hadoop
• Greenplum
• Azure Storage
• EMC Atmos
• Amazon S3
• SQL Azure (with Federations support)?

Real World Example
• Twitter
o The challenges
• Needs to store many graphs
 Who you are following
 Who‟s following you
 Who you receive phone
notifications from etc
• To deliver a tweet requires
rapid paging of followers
• Heavy write load as
followers are added and
removed
• Set arithmetic for @mentions
(intersection of users).

What did they try?
• Started with Relational
Databases
• Tried Key-Value storage
of denormalized lists
• Did it work?
o Nope
• Either good at
 Handling the write load
 Or paging large
amounts of data
 But not both

What did they need?
• Simplest possible thing that would work
• Allow for horizontal partitioning
• Allow write operations to
• Arrive out of order
o Or be processed more than once
o Failures should result in redundant work

• Not lost work!

The Result was FlockDB
• Stores graph data
• Not optimized for graph traversal operations
• Optimized for large adjacency lists
o List of all edges in a graph
• Key is the edge value a set of the node end points

• Optimized for fast read and write
• Optimized for page-able set arithmetic.

How Does it Work?
• Stores graphs as sets of edges between nodes
• Data is partitioned by node
o All queries can be answered by a single partition

• Write operations are idempotent
o Can be applied multiple times without changing the result

• And commutative
o Changing the order of operands doesn‟t change the result.

ACID
• Atomicity
o All or Nothing

• Consistency
o Valid according to all defined rules

• Isolation
o No transaction should be able to interfere with another transaction

• Durability
o Once a transaction has been committed, it will remain so, even in
the event of power loss, crashes, or errors

BASE
• Basically Available
o High availability but not always consistent

• Soft state
o Background cleanup mechanism

• Eventual consistency
o Given a sufficiently long period of time over which no changes are
sent, all updates can be expected to propagate eventually through
the system and all the replicas will be consistent.

Traditional (relational)
Approach
Extract Transactional Data Store

Transform

Data Warehouse
Load

Big Data Approach
• MapReduce Pattern/Framework
o an Input Reader
o Map Function – To transform to a common shape
(format)
o a partition function
o a compare function
o Reduce Function
o an Output Writer

MongoDB Example

> // map function > // reduce function
> m = function(){ > r = function( key , values ){
... this.tags.forEach( ... var total = 0;
... function(z){ ... for ( var i=0; i<values.length; i++ )
... emit( z , { count : 1 } ... total += values[i].count;
); ... return { count : total };
... } ...};
... );
...};

> // execute
> res = db.things.mapReduce(m, r, { out : "myoutput" } );

What is Hadoop?
• A scalable fault-tolerant grid operating system for
data storage and processing
• Its scalability comes from the marriage of:
o HDFS: Self-Healing High-Bandwidth Clustered Storage
o MapReduce: Fault-Tolerant Distributed Processing
• Operates on unstructured and structured data
• A large and active ecosystem (many developers
and additions like HBase, Hive, Pig, …)
• Open source under the friendly Apache License
• http://wiki.apache.org/hadoop/

Hadoop Design Axioms
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data
4. Simple Core, Modular and Extensible

Hadoop Core Components

Store Process

HDFS Map/Reduce

Self-healing Fault-tolerant
High-bandwidth distributed
Clustered storage processing

HDFS: Hadoop Distributed File System
Block Size = 64MB
Replication Factor = 3

Cost/GB is a few
¢/month vs $/month

Hadoop Job Architecture
Node
Manager

Container App Mstr

Client

Resource Node
Manager Manager
Client

App Mstr Container

MapReduce Status Node
Manager
Job Submission
Node Status
Resource Request Container Container

Microsoft embraces Hadoop

Good for enterprises & developers
Great for end users!

HADOOP
[Azure and Enterprise]

Java OM Streaming OM HiveQL PigLatin .NET/C#/F# (T)SQL

OCEAN OF DATA
NOSQL [unstructured, semi-structured, structured] ETL

HDFS

A SEAMLESS OCEAN OF INFORMATION PROCESSING AND ANALYTICs

EIS / RDBMS File OData Azure
ERP System [RSS] Storage

Hive Plug-in for Excel


Big Data (NJ SQL Server User Group)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data (NJ SQL Server User Group)

Similar to Big Data (NJ SQL Server User Group) (20)

Recently uploaded

Recently uploaded (20)

Big Data (NJ SQL Server User Group)

Editor's Notes