Big Data, NoSQL, NewSQL & The Future of Data Management

Big Data, NoSQL, NewSQL &
The Future of Data Management

Tony Bain
RockSolid SQL
www.rocksolidsql.com

RockSolid SQL

About this Webinar

 Big Data
• What is it?
• Why does it matter?
• How it impacts the Enterprise?

 Data Science
Tech Level 100
 NoSQL
• Key/Value Stores
• Document Stores
• Graph Databases

 NewSQL
• In Memory Transaction Processing
• MPPP

 Hadoop

 Summary

Big Data Webinar

What is Big Data?

“Big data[1] are datasets that grow so large that they become awkward to
work with using on-hand database management tools. Difficulties include
capture, storage,[2] search, sharing, analytics,[3] and visualizing. This trend
continues because of the benefits of working with larger and larger
datasets allowing analysts to "spot business trends, prevent diseases,
combat crime."[4] Though a moving target, current limits are on the order
of terabytes, exabytes and zettabytes of data.[5]”

“Variety– Big data extends beyond structured data, including
unstructured data of all varieties
Velocity – Rate of acquisition and desired rate of consumption.
Volume –Terabytes and even petabytes of information.”

My Definition

 Big Data is the “modern scale” at which we are defining or data usage
challenges. Big Data begins at the point where need to seriously start
thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case. Many
existing technologies have little problem physically handling large volumes
(TB or PB) of data. Instead the Big Data challenges result out of the
combination of volume and our usage demands from that data. And those
usage demands are nearly always tied to timeliness.

Big Data is therefore the push to utilize “modern” volumes of data within
“modern” timeframes. The exact definitions are of course are relative &
constantly changing, however right now this is somewhere along the path
towards the end goal. This is of course the ability to handle an unlimited
volume of data, processing all requests in real time.

Jan 2010

http://blog.tonybain.com/tony_bain/2010/01/what-is-big-data.html

For the purpose of this talk

Volume and Variety of Data
that is difficult to mange using
Big Data = traditional data management
technology

Where does big data come from?

 Big Data is not an outcome of what we consider traditional user
“transactions”.

“VisaNet authorizes, clears and settles an average of 130 million
transactions per day in 200 countries and territories.”
(http://corporate.visa.com/about-visa/technology/transaction-processing.shtml)

 Instead Big Data is mostly sourced from “Meta Data” about our
existence
• Sensors used to gather telemetry information
• Social media interactions & relationships
• Pictures and videos posted online
• Cell phone GPS signals
• Web page viewing habits
• Network, Computer generated, equipment logs
• Device outputs
• Online gaming

Where does big data come from?

 Instead Big Data is generally meta data

“Since we only track the data that the games explicitly want to track, and
it’s all structured data …We just write the 5TB of structured information that
we need, and this is a scale that high-end SQL databases can definitely
handle.”

“eBay processes 50 petabytes of information each day while adding 40
terabytes of data each day as millions buy and sell items across 50,000
categories. Over 5,000 business users and analysts turn over a terabyte of
data every eight seconds. Raising parallel efficiency, the effectiveness of
distributing large amounts of workload over pools and grids of servers
equals millions in operating expense savings.”

What about the Enterprise?

 The median data warehouse has remained consistent for a number of years (6.6GB)
• Is the pending Big Data explosion hype or a reality?

 The enterprise is focused on ROI
• We do store large amounts of historical data as this serves a perceived need & the overhead is low
• We have not yet started capturing large amounts of meta data as demonstrating return on this
investment to date would have been difficult
• More data will be captured as we gain the ability to derive value from that data
• More data will be captured as we understand the value of auxiliary data

 The skills gap
• Big data expertise are few and far between
• Technologies are new and raw
• The combination of skills is new

However this is expected to change in the next 2-5 years

Gartner Hype Cycle, 2011

http://www.gartner.com/technology/research/methodologies/hype-cycle.jsp

Why Big Data?

 Competitive Advantage is becoming highly time sensitive
• Consumers are fickle, change is easy, word of mouth is
rapid and powerful
• The rate at which we can understand and respond to
changes in customer sentiment, product needs,
satisfaction, engagement, risk profiles = competitive
strength
• Reinvention is constant

 So how do we do this?
• With Data Science!

What is Data Science?

“I keep saying that the sexy job in the next 10 years will be
statisticians,”
Hal Varian, chief economist at Google

"data is the next Intel Inside.“
Tim O'Reilly, O'Reilly

“We’re rapidly entering a world where everything can be
monitored and measured, but the big problem is going to be the
ability of humans to use, analyze and make sense of the data.”
Erik Brynjolfsson, MIT

“Data science is the civil engineering of data. Its acolytes
possess a practical knowledge of tools & materials, coupled with
a theoretical understanding of what's possible.”
Michael Driscoll , Metamarkets

What is Data Science?

 Historical Business Analytics
• Effectively reporting what has happened or what is happening
• Using pre-determined models for prediction
• Known data models
• Known questions
• Known outcomes

 The Future of Data Science
• Discovering new information from historical data
• Often involves combining public & private data sets
• Using machine learning to create unique prediction models
• Diverse and complex data models
• Questions may not be known
• Outcomes may not be expected
• Visualizing results

Public Data Sets

 There is a phenomenal amount of relevant publically available data sets
• Geographical, Climatic, Demographic, Knowledge bases
• Historical, political, medical, sports, social, music

 Infochimps (15,000 data sets)
• NYSE Daily 1970-2010 Open, Close, High, Low and Volume
• Global Daily Weather Data from the National Climate Data Center (NCDC)
• Average Hours Worked Per Day by Employed Persons: 2005
• Family Net Worth -- Mean and Median Net Worth in Constant

 Factual

 Data.gov

Telecommunications Churn

 Analyzed billions of calls from millions of customers
 More than 2,000 call history variables such as time to end of contract, subscriber tenure,
number of customer care calls, etc.
 Adding Social networking variables improved predicted churn much better.
 The "social networking" variables were identified by observing who each customer spoke
with on the phone the most frequently.

 If someone in your caller network churned, you were then ~700% more likely to churn as
well.

Visualizations

http://www.tableausoftware.com/public/gallery/taleof100

Visualizations

http://flowingdata.com/2011/10/12/where-people-dont-use-facebook/#more-19315

Big Data & the RDBMS

 Volume/Velocity is too great for a single system to manage

 Big Data often requires distribution over multiple hosts

 The general purpose relational database is difficult to scale out

 Has been a computer science challenge for many years
• Giving up nothing and scaling out has proven very difficult

Relaxing Consistency

Consistent, Eventually (Eventual Consistency)

“Given enough time without changes everything will eventually be consistent”

 Many big data technologies relax aspects of CAP to achieve scale, performance &
availability
 Not necessarily suitable for all requirements
 Not necessarily the only way to achieve scale

 Much debate CAP, BASE
 Tunable consistency

Big Data Webinar

What is NoSQL?

 NoSQL term was coined in 1998 however remained largely unused until 2009

 Resurgence was the result of key papers from Amazon & Google and the resulting projects at the likes of
Rackspace and Facebook.
• Google Bigtable (http://labs.google.com/papers/bigtable.html)
• Amazon Dynamo (www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf)

 Started as a design philosophy of simplifying data storage technologies to the point where massive scale
could be achieved

 Now more broad, includes a range of data management technologies which are not based on the
relational model.

 NoSQL data stores tends to fall into broad categories
• Distributed Key/Value stores
• Document stores
• Graph Databases

 A little ironic as many “NoSQL” data stores have evolved support for SQL like queries.

Big Data Webinar

NoSQL

www.google.com/trends

Big Data Webinar

RDBMS

www.google.com/trends

NoSQL

Distributed Key/Value Store - Column Family

 Database system designed to scale  Cassandra
• Distributed scale over many nodes
 Oracle Big Data Appliance
 Usually focused on transactional • BerkeleyDB
performance
• Usually unsuitable for OLAP  Voldemort

 Limited querying capabilities  Hypertable
• Primarily single record lookup by “key”
 Redis
 Usually flexible schema
• Row 1 may have different schema to  Raik
Row 2

 Usually
• No Joins / normalization
• No ordering
• No aggregation

NoSQL

Document Model

 Records are considered documents  MongoDB

 Records have a flexible or varying schema  CouchDb
depending on attributes of that document
 Marklogic
 Schemas are usually based on an encoding
such as JSON, BJSON or XML

 Document orientated data stores generally
have more functionality than k/v stores so
sacrifice some performance

 Document orientated data stores tend to fit
more naturally with OO paradigms than
RDBMS reducing plumbing code
significantly

 Scale out is generally possible using a
sharding and replication

NoSQL

Graph

 Graph databases are used to store entities  DEX
and relationships between entities
 FlockDB
 Nodes, properties and edges
 InfitieGraph
 Examples of Graph database strengths
• Shortest distance between entities  Neo4J
• Social network analysis
• If Bob knows Bill and Mary knows Bill
maybe Bob knows Mary

NewSQL

What is NewSQL

 Term coined recently by Michael Stonebraker

 Not throwing out the baby with the bathwater
• SQL and the Relational Model is well known by almost
every developer
• Arguments that the relational model itself is not a
limitation on scale, it is the physical implementation of
the model which has limitations

 Retaining key aspects of the RDBMS but removing some of
the general purpose nature

 By specializing challenges associated with scaling can be
more easy overcome

NewSQL

In Memory TP

 In Memory is becoming a reality for Examples
transaction processing
 VoltDB
 Often designed only to process short  SAP Hana
duration transactions

 Systems are appearing with 1TB RAM per
node

 Distributed K-Safety protects from single
node failure

 Traditional command logging/snapshoting
can provide protection from complete
system failure
• System bug
• Comes with performance costs

NewSQL

MPP

 Focus on distributing of data and load Examples
across multiple hosts for the purpose of
analytical query performance
 Microsoft Parallel Data Warehouse
 Often use brute force access methods (i.e.
 HP Vertica
no user indexing)
 EMC Greenplum
 Often focus on running low concurrency
high impact queries  Teradata / Aster Data
 Often columnar storage or hybrid  IBM Netezza
• Improved compression
• Improved query performance across
wide tables

Hadoop

What is Hadoop?

 Apache Hadoop is an open source project inspired by Google’s Map/Reduce
and GFS papers
• http://labs.google.com/papers/mapreduce.html
• http://labs.google.com/papers/gfs.html

 Hadoop consists of
• A distributed files system (HDFS)
• The Map/Reduce engine

 Yahoo has been the largest contributor to the project, although now many
original contributors are working for Cloudera or Hortonworks

 Hadoop is not a database
• Data lives in the file system in raw format
• Jobs process data direct from the files system using Map/Reduce
• Data can be directly accessed without ETL

The Scale of Hadoop

 Hadoop Scales to very large clusters, capable of supporting very large
workloads
• Yahoo has ~20 Hadoop clusters ~200PB (largest 4000 nodes)
• Facebook have 2000 nodes in a single cluster with 20PB

 Cloudera has 22 customers in prod with over 1PB

BUT we are not all Facebook or Yahoo

 Most clusters are less than 30 nodes
• The average cluster size is 200 nodes being boosted by the increasing
number of large clusters

 In the Enterprise Hadoop is being used increasingly for
• Data Processing
• Analytics

Hadoop

HDFS & Map/Reduce

 HDFS  Map/Reduce
 Distributed file system  Programming Paradigm
 Responsible for ensuring each  Takes seemingly complex tasks
block is replicated to multiple and breaks them into a series of
nodes • Mapping Tasks
• Protection from node failure • Reduction Tasks
• The large the cluster size the
more likely node failure is
 Elastic Map/Reduce

How is Hadoop used in the Enterprise?

FSI Retail & Manufacturing
 Customer Risk Analysis  Customer Churn
 Surveillance and Fraud Detection  Brand and Sentiment Analysis
 Central Data Repository  Point of Sales
 Personalization and Asset Management  Pricing Models
 Market Risk Modeling  Customer Loyalty
 Trade Performance Analytics  Targeted Offers

Science & Energy
 Genomics
 Utilities and Power Grid
 Smart Meters
 Biodiversity Indexing
 Network Failures
 Seismic Data

Where does this leave the RDBMS?

 Absolutely the RDBMS will remain the single most important data management technology for the
foreseeable future
• Its “General Purpose” nature still makes it the suitable technology of choice for 95% of data
management needs
• What we have discussed in this presentation is the 5% of requirements which have volume/velocity
issues not suitably managed by the RDBMS

 These technologies are on a convergence path
• NoSQL, NewSQL, Hapdoop, MPP – were all created out a need that wasn’t being met
• But we don’t “want” many different technologies, we like one “General Purpose” tool
• Bringing the technologies together through acquisition, innovation and integration is the role of the
major vendors

 This is already happening

The major Vendors are following

 Oracle
• Exadata
• Big Data Appliance (http://www.oracle.com/us/corporate/features/feature-obda-498724.html)
• Oracle Tools & Loader for Hadoop (http://www.oracle.com/us/corporate/features/feature-oracle-
loader-for-hadoop-505115.html)
• Exalytics Appliance (http://gigaom.com/2011/10/02/oracle-exalytics-attacks-big-data-analytics/)

 Microsoft
• Parallel Data Warehouse (http://www.microsoft.com/sqlserver/en/us/solutions-technologies/data-
warehousing/pdw.aspx)
• SQL Server connector for Hadoop (http://www.microsoft.com/download/en/details.aspx?id=27584)
• Microsoft Hadoop (2012) (http://www.zdnet.com/blog/microsoft/microsoft-to-develop-hadoop-
distributions-for-windows-server-and-azure/10958)
• Microsoft Daytona (http://research.microsoft.com/en-us/projects/daytona/)

How does an enterprise prepare?

 Understand success metrics & where competitive advantage is derived
• Customer retention, upsell, revenue growth

 Is business driven not technology driven

 IT can however ensure they are educated on new ways to fulfill business requirements

 Even if you remain aligned with major venders diversity is imminent

Big Data, NoSQL, NewSQL & The Future of Data Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Big Data, NoSQL, NewSQL & The Future of Data Management

Similar to Big Data, NoSQL, NewSQL & The Future of Data Management (20)

Recently uploaded

Recently uploaded (20)

Big Data, NoSQL, NewSQL & The Future of Data Management

Editor's Notes