Big Data Platforms: An Overview

Big Data Platforms:
An Overview
C. Scyphers
Chief Technical Architect

Daemon Consulting, LLC

What Is “Big Data”?
• Big Data is not simply a huge pile of information

• A good starting place is the following thought:

“Big Data describes datasets so large they become very
difficult to manage with traditional database tools.”

What Is A Big Data Platform?

Putting it simply, it is any platform which supports
those kind of large datasets.

It doesn’t have to be cutting edge technology.

Lots of legacy technologies can address the problem.

SQL
Of the new technologies, the most promising are from
the “NoSQL” family.

What Is “NoSQL”?

SQL
A family of non-relational data storage technologies

Usually with less cost than more traditional approaches.

Some of these
technologies
are new and
innovative

Others have been around for decades.

NoSQL Does Not Mean “SQL Is Bad”

When the trend was just starting, “NoSQL” was coined. It’s
unfortunate, because it implies antagonism towards SQL.

NoSQL Means “Not Only SQL”
RELATIONAL

RELATIONAL
NON-
NoSQL is a complement to a traditional RDBMS, not
necessarily as a replacement of them.

Scale is very hard without ridiculous expense

SQL can get very complex, very quickly

Changing a schema for a large production system is
both risky and expensive

Scale is achieved through a shared-nothing architecture,
removing bottlenecks

Schemaless design means change becomes much less
risky and significantly cheaper

Most solutions use simple RESTful interfaces

NoSQL is based upon a better understanding
of data storage, usually referred to as the

“CAP Theorem”

The CAP Theorem
Grossly simplified (with apologies to Brewer):

A database can be
• Consistent (All clients see the same data)
• Available (All clients can find some available node)
• Partition-Tolerant (the database will continue to function
even if split into disconnected sets – e.g. a network disruption)

Pick Any Two.

CAP In Practice
• Consistent & Available (no Partition Tolerance)
• Either single machines or single site clusters.
• Typically uses 2 phase commits

CAP In Practice
• Consistent & Partition Tolerant (no Availability)
• Some data may be inaccessible, but the remainder
is available and consistent
• Sharding is an example of this implementation

Customers Customers Customers
A-F G-R S-Z

CAP In Practice
• Available & Partition Tolerant (no Consistency)
• Some data may be inaccurate; a conflict resolution
strategy is required.
• DNS is an example of this, as well as standard
master-slave replication

CAP From A Vendor POV
• C-A (no P) – this is generally how most
RDBMS vendors operate

• C-P (no A) – this is how many RDBMS’
attempt to address scale
without incurring large costs

• A-P (no C) – this is how most NoSQL
approaches solve the problem

ACID vs BASE
Traditional Databases NoSQL Databases Tend
Are ACID Compliant To Be BASE Compliant
Atomicity – either the entire transaction Basically
completes or none of it does
Consistent – any transaction will take the Available
database from one consistent
state to another, with no
broken constraints
Isolation – changes do not affect other users Scalable
until committed
Durability – committed transactions can be Eventually consistent
recovered in case of system
failure

Eventually consistent is the key phrase here

SQL Strengths

Very well known technology

Interoperability across vendors

Large talent pool from which to choose

Ad hoc operations common, if not encouraged

NoSQL Strengths

Built to address massive scale

Through horizontal scalability

While remaining highly available

And handling unstructured data

NoSQL Pros/Cons
Pros Cons

• Schema Evolution • Querying the data is
• Horizontal Scalability much harder
• Simple Protocols • Paradigm Shift
• Security is a big issue
• May or may not
support data types
(BLOBs, spatial)
• Generally, uniqueness
cannot be enforced

A Disclaimer Before We Continue
• I am not an expert on every possible Big Data
Platform
• There are hundreds of them; these are
the ones I consider the leaders in the field
and recommend
• If you have a favorite, please let me know
and I’ll update the deck for next time
• The internal details on how these systems
work are rather complex; I would prefer to
take those questions offline

Flavors Of NoSQL

The major four divisions of NoSQL are:

• Key-Value
• Document Store
• Columnar
• Other

Key-Value
• At a very high level, key-value works essentially
by pairing a index token (a key) with a data
element (a value).
• Both index token and the data value can be of
any structure.
• Such a pairing is arbitrary and up to the
developer of the system to determine.

A Key-Value Example

“John Smith”, “100 Century Dr. Alexandria VA 22304”

“John Doe”, “16 Kozyak Street, Lozenets District, 1408 Sofia Bulgaria”

In both examples, the key is a name and the value is an address.
However, the structure of the address differs between the two.

Document Store
• Document stores extend the key-value paradigm
into values with multiple attributes.
• The document values tend to be semi-structured
data (XML, JSON, et al) but can also be Word or
PDF documents.

A Document Store Example
“John Smith”, “<address><street>100 Century Dr.</street>
<city>Alexandria</city>
<state>VA</state>
<postalCode>22304</postalCode>
</address>”

“John Doe”, “{
“address”: {
“street”: “16 Kozyak Street”
“district”: “Lozenets, 1408”
“city”: “Sofia”
“country”: “Bulgaria”
}
}”

Columnar Family

• Usually has “rows” and “columns”
• Or, at least, their logical equivalents
• Not a traditional, “pure” column store
• More of a hybridized approach leveraging
key-value pairs
• A key with many values attached

The Others
• Hierarchical Databases
• LDAP, Active Directory
• Graph Databases
• Neo4j, Flock DB, InfiniteGraph
• XML
• MarkLogic
• Object Oriented Databases
• Versant
• Lotus Notes
• HPCC (LexisNexis)

Key-Value Recap

Pairing a index token (a key)
with a data element (a value)

Key-Value Pro/Con
Pros Cons
• Schema Evolution • Packing & unpacking each key
• Horizontal Scalability • Keys typically are not related
• Simple Protocols to each other
• Works well for volatile data • The entire value must be
• High throughput, typically returned, not just a part of it
optimized for reads or writes • Security tends to be an issue
• Keys become meaningful • Hard to support reporting,
rather than arbitrary analytics, aggregation or
• Application logic defines ordered values
object model • Generally does not support
updates in place
• Application logic defines
object model

Where Did Key-Value
Come From?

The concept is quite old, but most people trace the
lineage back to Amazon and the Dynamo paper.

Dynamo
Amazon devised the Dynamo engine as a way to
address their scalability issues in a reliable way.

• Communication between nodes is peer to peer
(P2P)
• Replication occurs with the end client
addressing conflict resolution
• Quorum Reads/Writes
• Always writable (Hinted Handoff)
• Eventually Consistent

Eventually Consistent
• Rather than expending the runtime resources
to ensure that all nodes are aware of a change
before continuing, Dynamo uses an eventually
consistent model.

• In this model, a subset of nodes are changed
• Those nodes then inform their neighbors until
all nodes are changed (grossly simplifying).

Can I Use Dynamo?

No. It’s an Amazon only internal product.
However, AWS S3 is largely based upon it.

Amazon did announce a DynamoDB offering for
their AWS customers. While it’s probably the
same, I cannot guarantee that it is.

Riak
• Riak is a key-value database largely
modeled after the Dynamo model.
• Open source (free) with paid support
from Basho.

• Main claims to fame:
• Extreme reliability
• Performance speed

Riak Pro/Con
Pros Cons
• All nodes are equal – no • Not meant for small, discrete
single point of failure and numerous datapoints.
• Horizontal Scalability • Getting data in is great;
• Full Text Search getting it out, not so much
• RESTful interface (and HTTP) • Security is non-existent:
• Consistency level tunable on “Riak assumes the internal
each operation environment is trusted”
• Secondary indexes available • Conflict resolution can bubble
• Map/Reduce (JavaScript & up to the client if not careful.
Erlang only) • Erlang is fast, but it’s got a
serious learning curve.

Redis
• Redis is a key-value in-memory datastore.
• Open source (free) with support from the
community.

• Fast. So very, very fast.
• Transactional support
• Best for rapidly changing data

Redis Pro/Con
Pros Cons
• Transactional support • Entirely in memory
• Blob storage • Master-slave replication
• Support for sets, lists and (instead of master-master)
sorted sets • Security is non-existent:
• Support for Publish-Subscribe designed to be used in
(Pub-Sub) messaging trusted environments
• Robust set of operators • Does not support encryption
• Support can be hard to find

Voldemort
• Voldemort is a key-value in-memory database
built by LinkedIn.
• Open source (free) with support from the
community

• Low latency
• Highly Available
• Very fast reads

Voldemort Pro/Con
Pros Cons
• Highly customizable – each • Versioning means lots of disk
layer of the stack can be space being used.
replaced as needed • Does not support range
• Data elements are versioned queries
during changes • No complex query filters
• All nodes are independent – • All joins must be done in
no single point of failure code
• Very, very fast reads • No foreign key constraints
• No triggers
• Support can be hard to find

Key/Value
“Big Vendors”
• Microsoft Azure Table Storage
• Oracle NoSQL
• BerkleyDB (Oracle)

Document Store Recap

Document stores store an index token
with a grouping of attributes in a semi-
structured document

Document Store Pro/Con
Pros Cons
• Tends to support a more • The entire value must be
complex data model than returned, not just a part of it
key/value • Security tends to be an issue
• Good at content • Joins are not available within
management the database
• Usually supports multiple • No foreign keys
indexes • Application logic defines
• Schemaless (can be nested) object model
• Typically low latency reads
• Application logic defines
object model

CouchDB
• CouchDB is a document store database.
• Open source (free), part of the Apache
foundation with paid support available from
several vendors.

• Simple and easy to use
• Good read consistency
• Master-master replication

CouchDB Pro/Con
Pros Cons
• Very simple API for • The simple API for
development development is somewhat
• MVCC support for read limited
consistency • No foreign keys
• Full Map/Reduce support • Conflict resolution devolves
• Data is versioned to the application
• Secondary indexes supported • Versioning requires extensive
• Some security support disk space
• RESTful API, JSON support • Versioning places large load
• Materialized views with on I/O channels
incremental update support • Replication for performance,
not availability

MongoDB
• MongoDB is a document store database.
• Open source (free) with paid support available
from 10Gen.

• Index anything
• Ad hoc query support
• SQL like operations
(not SQL syntax)

MongoDB Pro/Con
Pros Cons
• Auto-sharding • Does not support JSON: BSON
• Auto-failover instead
• Update in place • Master-slave replication
• Spatial index support • Has had some growing pains
• Ad hoc query support (e.g. Foursquare outage)
• Any field in Mongo can be • Not RESTful by default
indexed • Failures require a manual
• Very, very popular (lots of database repair operation
production deployments) (similar to MySQL)
• Very easy transition from SQL • Replication for availability,
not performance

Document Store
“Big Vendors”
• Lotus Domino

Columnar Family Recap

• A key with many values attached
• Usually presenting as “rows” and “columns”
• Or, at least, their logical equivalents

Columnar Pro/Con
Pros Cons
• Tend to have some level of • Is much less efficient when
rudimentary security support processing many columns
• Usually include a degree of simultaneously
versioning • Joins tend to not be
• Can be more efficient than supported
row databases when • Referential integrity not
processing a limited number available
of columns over a large
amount of rows

Where Did Columnar
Come From?

The concept has been around for a while, but most
people trace the NoSQL lineage back to Google.

BigTable
Google devised the BigTable engine as a way to
address their search related scalability issues in a
reliable way.
• Data is organized through a set of keys:
• Row • Column • Timestamp
• A hybrid row/column store with a single master
• Versioning is handled through the time key
• Tablets are a dynamic partition of a sequence of
rows – supports very efficient range scans
• Columns can be grouped into column families
• Column families can have access control

Can I Use BigTable?

No. It’s a Google only internal product. However,
quite a few open source products are built upon
the concepts.

Cassandra
• Cassandra is a hybrid of Big Table built on
Dynamo infrastructure
• Open source (free), built by Facebook with paid
support available from several vendors.

• An Apache project
• Very, very fast writes
• Spans multiple datacenters

Cassandra Pro/Con
Pros Cons
• Designed to span multiple • No joins
datacenters • No referential integrity
• Peer to peer communication • Written in Java – quite
between nodes complex to administer
• No single point of failure and configure
• Always writeable • Last update wins
• Consistency level is tunable at
run time
• Supports secondary indexes
• Supports Map/Reduce
• Supports range queries

HBase
• Hbase is a columnar database built on top of the
Hadoop environment.
• Open source (free) with paid support from
numerous vendors

• Ad hoc type abilities
• Easy integration with
Map/Reduce

HBase Pro/Con
Pros Cons
• Map/Reduce support • Secondary indexes generally
• More of a CA approach and not supported
an AP • Security is non-existent
• Supports predicate push • Requires a Hadoop
down for performance gains infrastructure to function
• Automatic partitioning and
rebalancing of regions
• Data is stored in a sorted
order (not indexed)
• RESTful API
• Strong and vibrant ecosystem

Hadoop
• Hadoop is not a columnar store as such.
• Rather, Hadoop is a massively parallel data
processing engine

• Specializes in unstructured data
• Very flexible and popular

Hadoop Pro/Con
Pros Cons
• While written in Java, almost • Large amounts of disk space
any language can leverage and bandwidth required
Hadoop • Paradigm shift for IT staff
• Runs on commodity servers • Quality talent is highly in
• Horizontally scalable demand and expensive
• Very fast and powerful • Security is non-existent
• Where Map/Reduce • Name node is a single point
originated of failure
• Ample support from vendors • More or less only supporting
• “Helper” languages like Hive batch processing
and Pig • Not user friendly to anyone
• Strong and vibrant ecosystem other than developers

Hadoop Users

Plus lots, lots more

Columnar “Big Vendor”

• EMC Greenplum
• Teradata Aster

In so far as both of these solutions are grafting
Map/Reduce into a (more or less) SQL environment

Which One Do
I Use Where?
• Key-Value for (relatively) simple, volitile data
• Document store for more complex data
• Columnar for analytical processing
• RBDMS for traditional processing – particularly
where a lazy consistency is not acceptable
• Point Of Sale, for example

@scyphers

Additional Information At
http://www.daemonconsulting.net/BDC-FOSE-2012

Daemon Consulting, LLC
http://www.daemonconsulting.net/

Specializing In The Hard Stuff

Big Data Platforms: An Overview

More Related Content

What's hot

Viewers also liked

Similar to Big Data Platforms: An Overview

Recently uploaded

Big Data Platforms: An Overview