2. What is covered in thispresentation?
A brief history of databases
NoSQL WHY, WHAT & WHEN?
Characteristics of NoSQL databases
Aggregate data models
CAP theorem
3. Introduction
• Database - Organized collection of data
• DBMS - a software package with computerprograms
that controls the creation, maintenance and use of a
database
• Databases are created to operate large quantities of
information by inputting, storing, retrieving, and
managing that information
5. • Benefits of Relational databases:
Designed for all purposes
ACID
Strong consistancy, concurrency, recovery
Mathematical background
Standard Query language (SQL)
Lots of tools to use with i.e: Reporting services, entity
frameworks, ...
Relational databases
7. But...
• Relational databases were not
built for distributed applications.
Because...
• Joins are expensive
• Hard to scale horizontally
• Impedance mismatch occurs
• Expensive (product cost,
hardware, Maintenance)
NoSQL why, what and when?
14. Data size growth
Examples:
• ISRO launches the advanced earth observation
and mapping satellite CARTOSAT-3 along with
13 other commercial nano-satellites
– Information and images coming from the satellite
• Maharashtra Election : 20000 tweets/second
• Around 30 billion RFID tags produced/year
– Automatic toll collection using RFID
• Oil drilling platforms have 20k to 40k sensors
95% of data produced is unstructured
15. Challenge
Big Data’s characteristics are challenging conventional information
management architectures
Massive and growing amounts of information residing internal
and external to the organization
Unconventional semi structured or unstructured (diverse)
including web pages, log files, social media, click-streams,
instant messages, text messages, emails, sensor data from
active and passive systems, etc.
Changing information
15
Multi-Channel
analytics
Sentiment
analytics Transaction
analytics
Call Detail Records
analytics
Warranty claim
analytics
Surveillance
analytics
Claim fraud
analytics
16. What is big data?
“A massive volume of both structured and unstructured data
that is so large that it's difficult to store, analyse, process,
share, visualise and manage with traditional database and
software techniques.” - Roger Magoulas of O’reilly in 2005
• Big data technologies describe a new generation of
technologies and architectures, designed to economically
extract value from very large volumes of a wide variety of
data, by enabling high velocity capture, discovery, and/or
analysis
• IBM / MS
– Volume (Terabytes -> Zettabytes)
– Variety (Structured -> Semi-structured -> Unstructured)
– Velocity (Batch -> Streaming Data)
17. What Makes it Big Data? (V3)
VOLUME VELOCITY VARIETY VALUE
SOCIAL
BLOG
SMART
METER
1011001010010
0100110101010
1011100101010
100100101
• Volume:Gigabyte(109), Terabyte(1012), Petabyte(1015),
Exabyte(1018), Zettabytes(1021)
• Variety: Structured,semi-structured, unstructured; Text, image,
audio, video, record
• Velocity (Dynamic, sometimes time-varying)
18. Variability:
Variability vs variety. 6
different coffee blends tastes
different every day, that is
variability.
The same is true of data, if the
meaning is constantly
changing it can have a huge
impact on your data
homogenization.
Visualization:
Using charts and graphs to
visualize large amounts of
complex data
19. A NoSQL database provides a
mechanism for storage and retrieval
of data that employs less constrained
consistency models than traditional
relational database
No SQL systems are also referred to
as "NotonlySQL“ to emphasize that
they do in fact allow SQL-like query
languages to be used.
But What is NoSQL?
20. NoSQL avoids:
Overhead of ACID transactions
Complexity of SQL query
Burden of up-front schema design
DBA presence
Transactions (It should be handled
at application layer)
Provides:
Easy and frequent changes to DB
Fast development
Large data volumes(eg.Google)
Schema less
Characteristics of NoSQLdatabases
22. In relational Databases:
You can’t add a record which does
not fit the schema
You need to add NULLs to
unused items in a row
We should consider the datatypes.
i.e : you can’t add a stirng to an
interger field
You can’t add multiple items in a
field (You should create another
table: primary-key, foreign key,
joins, normalization, ... !!!)
What is aschema-lessdatamodel?
23. In NoSQL Databases:
There is no schema to consider
There is no unused cell
There is no datatype (implicit)
Most of considerations are done
in application layer
We gather all items in an aggregate
(document)
What is aschema-lessdatamodel?
24. NoSQL databases are classified in four
major datamodels:
• Key-value
• Document
• Column family
• Graph
Each DB has its own query language
Categories of NoSQL databases
25. Simplest NOSQL databases
The main idea is the use
of a hash table
Access data (values) by
strings called keys
Data has no required format
data may have any format
Data model: (key, value) pairs
Basic Operations:
Insert(key,value),
Fetch(key), Update(key),
Delete(key)
Key-value data model
26. Row oriented DB – stores row by row, suitable for
OLTP
Column oriented DB – stores column by column –
OLAP
Companies such as Facebook, Twitter, Yahoo, and
Adobe use HBase internally (large data and
random read/write)
The column is lowest/smallest instance of data.
It is a tuple that contains a name, a value and a
timestamp
Column family datamodel
29. Some statistics about Facebook Search (usingCassandra)
MySQL>50 GBData
Writes Average: ~300ms
ReadsAverage: ~350 ms
Rewritten with Cassandra>50 GBData
Writes Average: 0.12ms
ReadsAverage: 15 ms
Column family datamodel
30. Based on Graph Theory.
Scale vertically, no clustering.
You can use graph algorithms
easily
Transactions
ACID
Graph data model
31. • Pair each key with complex data
structure known as data
structure.
• Indexes are done via B-Trees.
• Documents can contain many
different key-value pairs, or key-
array pairs, or even nested
documents.
Document baseddata model
33. • NoSQL may complement RDBMS
– RDBMS may hold smaller amounts of high-value structured data
– NoSQL may hold vast amounts of less valued and less structured
• Relational implementations provide ACID guarantees
– Atomicity: transaction treated an all or nothing operation
– Consistency: database values correct before and after
– Isolation: as if only transaction.
– Durability: upon completion of transaction, operation is not reversed.
• NoSQL often provides BASE
– Basically available: Allowance for parts of a system to fail (sharding/
partitioning)
– Soft state: An object may have multiple simultaneous values (at
different times)
– Eventually consistent: Consistency achieved over time (not on every
commit)
• CAP Theorem
– It is impossible to have consistency, availability, and partition
tolerance in a distributed system
34. What we need?
• Weneed adistributed database system having such features:
•
•
•
•
– Faulttolerance
– Highavailability
– Consistency
– Scalability
Which isimpossible!!!
According to CAPtheorem
35. Wecannot achieve all the three items
In distributed databasesystems(center)
The CAP theorem