Choosing the Right Big Data Tools for the Job - A Polyglot Approach

www.Objectivity.com

Choosing The Right Big
Data Tools For The Job
– A Polyglot Approach

A Webinar Presented by Leon Guzenda
on August 9, 2012

Overview

The Problem

•
Current Big Data Analytics

•
Relationship Analytics

•
Leveraging Alternative Technologies
–
NoSQL

•
The Polyglot Approach

About Objectivity Inc.
Company • Objectivity, Inc. is headquartered in Sunnyvale, CA.
• Established in 1988 to tackle database problems that network/hierarchical/relational and file-based technologies
struggle with.

• Objectivity has over two decades of Big Data and NoSQL experience

Products • Develops NoSQL platforms for managing and discovering relationships and patterns in complex data:
• Objectivity/DB - an object database that manages localized, centralized or distributed databases
• InfiniteGraph - a massively scalable graph database built on Objectivity/DB that enables organizations
to find, store and exploit the relationships in their data

Markets • The Big Data market is projected to be around $12B in 2012, with a CAGR of 28% over the next five years.
• 40% per year data growth, cloud adoption, mobile usage and improved real-time analytics underpin Objectivity’s
growth opportunities as a Big Data analytics enabler.

Customers • Embedded in hundreds of enterprises, government organizations and products - millions of deployments.

Financials • Consistently generates increased revenues.
• Privately held by the employees and a few venture capital companies.

Copyright © Objectivity, Inc. 2012

The Problem

Information Overload!

Making sense of it all takes time and $$$

Current “Big Data” Analytics

A Typical “Big Data” Analytics Setup

Data Aggregation and Analytics Applications

Commodity Linux Platforms and/or High Performance Computing Clusters

Column Data Graph Object K-V
RDBMS Hadoop Doc DB
Store W/H DB DB Store

Structured Semi-Structured Unstructured

Leveraging Alternative Technologies

Not Only SQL – a group of 4 primary technologies
•
Users choose between four different primary technologies for different
purposes:
–
Key-Value Stores
–
“Big Table” Clones
–
Document Databases
–
Object and Graph databases (including InfiniteGraph)

•
Many implementations sacrifice consistency (ACID transactions, CAP
– eventual consistency) for performance.

•
Technologies such as Objectivity/DB and InfiniteGraph offer ACID
transactions, with consistency and performance.

Key-Value Stores

“Dynamo: Amazon’s High Available Key-Value Store” [2007]

•
Data model:
–
Global key-value mapping
–
Scalable (sharded) HashMap KEY VALUE
–
Highly fault tolerant (typically)

•
Examples:
–
Riak, Redis and Voldemort

Key-Value Stores: Pros & Cons
•
Strengths:
–
Simple data model
–
Great at scaling out horizontally
–
Scalable
–
Available
KEY VALUE
•
Weaknesses:
–
Simplistic data model
–
Poor for complex data
–
Unsuited for interconnected data

Big Table Clones – Column Family
•
Google’s “Bigtable: A Distributed Storage System for
Structured Data” [2006]
•
Column-Family are essentially Big Table clones.
Column
•
Data Model: KEY Column Name Value D/Time
–
A big table, with column families.
–
Map-reduce for parallel query/processing.

•
Examples:
–
Hbase, HyperTable and Cassandra.

Big Table Clones – Pros & Cons
•
Strengths:
–
Data model supports semi-structured data
–
Naturally indexed (columns)
–
Good at scaling out horizontally

Column
•
Weaknesses:
KEY Column Name Value D/Time
–
Complex data model
–
Unsuited for highly interconnected data

Document Databases
•
Data Model:
–
A collection of unstructured or semi-structured documents.
–
Each document is referenced using a key-value pair.
–
The “value” can range from unstructured text to a collection of key-
value pairs or a group of XML objects.
–
Index-centric to support queries based on content.

•
Examples:
KEY DOCUMENT
–
CouchDB and MongoDB.

Document Databases – Pros & Cons
•
Strengths:
–
Simple, powerful data model
–
Good scalability if sharding is supported

•
Weaknesses: KEY DOCUMENT
–
Unsuited for interconnected data
–
Query model limited is to keys and indexes
–
Generally uses Map-Reduce (designed for batch operations) for
larger queries

Object Databases
•
Data Model [ODMG'93]:
–
Objects have a Class (type) and a group of Values
–
Each Object instance has a unique Object Identifier [OID]
–
Connections use Object Identifiers for efficiency
–
Supports class inheritance and polymorphism

•
Examples:
OID OBJECT
–
Objectivity/DB and db4objects
Connections

Object Databases – Pros & Cons
•
Strengths:
–
Simple, powerful data model that includes inheritance and
polymorphism
–
Every object has a class (type) and a unique Object Identifier
–
Good scalability if sharding is supported
–
Uses Object Identifiers instead of JOIN tables to support very fast
navigational operations OID OBJECT

Connections
•
Weaknesses:

–
The query language never became a standard
–
Supports standard object oriented languages but isn't supported by
a wide range of third party tools in the way that SQL is.

Graph Databases
•
Data model:
–
Node (Vertex) and Relationship (Edge) objects
–
Directed
–
May be a hypergraph (edges with multiple endpoints)

•
Examples:
–
InfiniteGraph, Neo4j, OrientDB, AllegroGraph, TitanDB and Dex

2 N
VERTEX EDGE

Graph Databases – Pros & Cons
•
Strengths:
–
Extremely fast for connected data
–
Scales out, typically
–
Easy to query (navigation)
–
Simple data model

•
Weaknesses:
–
May not support distribution or sharding
–
Requires conceptual shift... a different way of thinking

2 N
VERTEX EDGE

Competing “Big Data” Analytics Solutions

Typical “Big Data” Analytics Phases

Analytics and
Front-End Processing Repository Visualization Tools

The strategic competitors are all moving in the same direction

Incremental Improvements Aren’t Enough

All current solutions use the same basic architectural model

• None of the current solutions have a way to store connections between
entities in different silos

• Most analytic technology focuses on the content of the data nodes,
rather than the many kinds of connections between the nodes and the
data in those connections

• Why? Because relational and most NoSQL solutions are bad at handling
relationships.

• Object and Graph databases can efficiently store, manage and query the
many kinds of relationships hidden in the data.

Example 1 - Market Analysis
The 10 companies that control a majority of U.S. consumer goods brands

Example 2 - Demographics
Used in social network analysis, marketing, medical research etc.

Example 3 - Seed To Consumer Tracking

?

Example 4 - Ad Placement Networks

Smartphone Ad placement - based on the the user’s profile and location data
captured by opt-in applications.

• The location data can be stored and distilled in a key-value and column store
hybrid database, such as Cassandra

• The locations are matched with geospatial data to deduce user interests.
• As Ad placement orders arrive, an application built on a graph database such
as InfiniteGraph, matches groups of users with Ads:

• Maximizes relevance for the user.
• Yields maximum value for the advertiser and the placer.

Example 5 - Healthcare Informatics

Problem: Physicians need better electronic records for managing patient data on a global
basis and match symptoms, causes, treatments and interdependencies to improve
diagnoses and outcomes.

• Solution: Create a database capable of leveraging existing architecture using NOSQL tools
such as Objectivity/DB and InfiniteGraph that can handle data capture, symptoms,
diagnoses, treatments, reactions to medications, interactions and progress.

• Result: It works:
• Diagnosis is faster and more accurate
• The knowledge base tracks similar medical cases.
• Treatment success rates have improved.

Relationship (Connection) Analytics...
Relational Database
Think about the SQL query for finding all links between the two “blue” rows... Good luck!
Table_A Table_B Table_C Table_D Table_E Table_F Table_G

Relational databases aren’t good at handling complex relationships!

Relationship (Connection) Analytics...
Relational Database
Think about the SQL query for finding all links between the two “blue” rows... Good luck!
Table_A Table_B Table_C Table_D Table_E Table_F Table_G

Objectivity/DB or InfiniteGraph - The solution can be found with a few lines of code

A3 G4

Lesson 1 – The Repository Matters A Lot

NEED RDBMS Key- Column Document ODBMS Graph
Value Family Database Database
OLTP YES No Maybe No Maybe No
Text No No No YES Maybe No
Handling
Multimedia No Maybe No Maybe YES Maybe
Engineering/ No No No No YES Maybe
Scientific
Business YES No Maybe No Maybe Maybe
Intelligence
Log Maybe No Maybe No YES Maybe
Processing
Connection No No No No Maybe YES
Handling/
Analysis

Lesson 2 – Languages and Tools Matter Too

NEED Repository Language BI Tools Visual
Analytics
OLTP RDBMS SQL, Java YES Maybe
Text Document Java, XML No Maybe
Database
Multimedia ODBMS Java, C++ No Maybe
Eng/Science ODBMS C,C++, R Maybe YES
Fortran
Business RDBMS Java, SQL, R YES YES
Intelligence
Log NoSQL, C++, R, Maybe YES
Processing ODBMS Java, SQL
Connection Graph Java, C++, Maybe YES
Handling/ Database SPARQL
Analysis

SUMMARY: A Polyglot Approach Works Best...

LANGUAGE REPOSITORY

PROBLEM

ANALYTICS

BI TOOLS GRAPH TOOLS VISUAL ANALYTICS

...SUMMARY: A Polyglot Approach Works Best

InfiniteGraph
THE BIG DATA CONNECTION PLATFORM

InfiniteGraph - The Enterprise Graph Database

• A high performance distributed database engine that supports analyst-time decision
support and actionable intelligence
• Cost effective link analysis – flexible deployment on commodity resources (hardware
and OS).
• Efficient, scalable, risk averse technology – enterprise proven.
• High Speed parallel ingest to load graph data quickly.
• Parallel, distributed queries
• Flexible plugin architecture
• Complementary technology
• Fast proof of concept – easy to use Graph API.

Objectivity/DB
A distributed, object database built for handling data with many complex relationships.

• Reliable - Deployed in process control, telecom and medical equipment, Big Science,
complex financial, defense and Intelligence Community applications.

• Provably scalable - used to build the World’s first Petabyte+ database at Stanford
Linear Accelerator in the year 2000.

• Advanced query capabilities - Parallel Query Engine
• Interoperable - across languages and platforms
–
C++, C#, Java, Python and SQL++
–
Linux, Mac OS X and Windows (32 and 64-bit)

The Big Data Connection Platform

Data Visualization
& Analytics
*Now HP *Now IBM

Big Data Connection
Platform

Processing Platform
*Now EMC *Now IBM *Now IBM
*Now Teradata *Now HP
*Now SAP

Connectors /
Integration

Servers /
File Storage *Now Oracle

Thank You!

Please take a look at objectivity.com
For Online Demos, White Papers, Free Downloads,
Samples & Tutorials

You Can Also See Us At NoSQL Now!
In San Jose, CA on August 22

Choosing the Right Big Data Tools for the Job - A Polyglot Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Choosing the Right Big Data Tools for the Job - A Polyglot Approach

Similar to Choosing the Right Big Data Tools for the Job - A Polyglot Approach (20)

More from DATAVERSITY

More from DATAVERSITY (20)

Recently uploaded

Recently uploaded (20)

Choosing the Right Big Data Tools for the Job - A Polyglot Approach

Editor's Notes