SlideShare a Scribd company logo
1 of 82
Download to read offline
POLYGLOT PERSISTENCE AND MULTI-CLOUD DATA
MANAGEMENT SOLUTIONS
GENOVEVAVARGAS SOLAR
FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE
Genoveva.Vargas@imag.fr	
  	
  
http://www.vargas-­‐solar.com	
  
DATA ALL AROUND
2
30 billion
pieces of content
are shared on Facebook
every month
40 billion+
hours video
are watched onYouTube
each month
As of 2011 the global size
Data in healthcare was
estimated to be
150 Exabytes
(161 billion of Gigabytes)
40 millionTweets
are sent per day about 200
monthly active users
By 2014 it is anticipated
there will be
400 million
wearable wireless
health monitors
Web 2.0 sites where millions of users may both read and
write data, scalability for simple database operations has
become important
Data collections available through front ends managed
through public/private organization, available through the
Web, e.g., Sloan Sky Server
STORING AND ACCESSING HUGE AMOUNTS OF DATA
Peta	
  1015	
  
Exa	
  1018	
  
Zetta	
  1021	
  
Yota	
  1024	
  
RAID	
  
Disk	
  
Cloud	
  
3
•  Data formats
•  Data collection sizes
•  Data storage supports
•  Data delivery mechanisms
4
THIS TALK IS NOT ABOUT
5
http://nosql-­‐database.org	
  
Debate on whether NoSQL stores and relational systems are better or worse …
that is not the point
Of course we can surf on these waves
at the end of the talk and during EDBT School!
THIS TALK IS ABOUT
6
alternative for managing multiform and multimedia data collections
according to different properties and requirements
AGENDA
¡  Dealing with multiform and multimedia data collections:
it’s all about management requirements
¡  Polyglot persistence general principle
¡  The NoSQL plethora
¡  Polyglot database
¡  Design
¡  Deployment
¡  Querying, update and maintenance
¡  Conclusions and outlook
7
8
POLYGLOT PERSISTENCE
¡  Polyglot Programming: applications should be written in a mix of languages to take
advantage of different languages are suitable for tackling different problems
¡  Polyglot persistence: any decent sized enterprise will have a variety of different data
storage technologies for different kinds of data
¡  a new strategic enterprise application should no longer be built assuming a relational
persistence support
¡  the relational option might be the right one - but you should seriously look at other
alternatives
9
M.	
  Fowler	
  and	
  P.	
  Sadalage.	
  NoSQL	
  Distilled:	
  A	
  Brief	
  Guide	
  to	
  the	
  Emerging	
  World	
  of	
  Polyglot	
  Persistence.	
  Pearson	
  Education,	
  Limited,	
  2012	
  
(Katsov-2012)
Use the right tool for the right job…
How do I know which is the
right tool for the right job?
10
SCALING DATABASE SYSTEMS
¡  A system is scalable if increasing its resources
(CPU, memory, disk) results in increased
performance proportionally to the added
resources
¡  Improving performance means serving more units
of work for handling larger units of work like when
data sets grow
¡  Database systems have been scaled by buying
bigger faster and more expensive machines
11
¡  Vertically (SCALE UP)
¡  Add resources (CPU, memory) to a single node in a system
¡  Horizontally (SCALE OUT)
¡  Add more nodes to a system
NOSQL STORES CHARACTERISTICS
¡  Simple operations
¡  Key lookups reads and writes of one record or a small number of records
¡  No complex queries or joins
¡  Ability to dynamically add new attributes to data records
¡  Horizontal scalability
¡  Distribute data and operations over many servers
¡  Replicate and distribute data over many servers
¡  No shared memory or disk
¡  High performance
¡  Efficient use of distributed indexes and RAM for data storage
¡  Weak consistency model
¡  Limited transactions
12
Next generation databases mostly addressing some of the points: being non-relational, distributed,
open-source and horizontally scalable [http://nosql-­‐database.org]	
  
DEALING WITH HUGE AMOUNTS OF DATA
13
Peta	
  1015	
  
Exa	
  1018	
  
Zetta	
  1021	
  
Yota	
  1024	
  
RAID	
  
Disk	
  
Cloud	
  
Concurrency	
  
Consistency	
  
Atomicity	
  
Relational	
  
Graph	
  
Key	
  value	
  
Columns	
  
DATA MANAGEMENT SYSTEMS ARCHITECTURES
Physical	
  model	
  
Logic	
  model	
  
External	
  model	
  
ANSI/SPARC	
  
Storage	
  
Manager	
  
Schema	
  
Manager	
  
Query	
  
Engine	
  
Transaction	
  
Manager	
  
DBMS	
  
Customisable	
  points	
  
Custom	
  components	
  
Glue	
  
code	
  
Data	
  	
  
services	
  
Access	
  
services	
  
Storage	
  
services	
  
Additional	
  
Extension	
  
services	
  
Other	
  
services	
  
Extension	
  services	
  
Streaming, XML, procedures,
queries, replication
Physical	
  model	
  
Logic	
  model	
  
External	
  model	
  
Physical	
  model	
  
Logic	
  model	
  
External	
  model	
  
14
15
Data	
  stores	
  designed	
  	
  to	
  scale	
  simple	
  	
  
OLTP-­‐style	
  applica7on	
  loads	
  	
  
•  Data	
  model	
  	
  
•  Consistency	
  	
  
•  Storage	
  	
  
•  Durability	
  	
  
•  Availability	
  	
  
•  Query	
  support	
  
Read/Write	
  operations	
  	
  
by	
  thousands/millions	
  of	
  users	
  
PROBLEM STATEMENT: HOW MUCH TO GIVE UP?
¡  CAP theorem1: a system can have two of the three properties
¡  NoSQL systems sacrifice consistency
16
Consistency	
  Availability	
  
Fault-­‐tolerant	
  	
  
partitioning	
  
1	
  Eric	
  Brewer,	
  "Towards	
  robust	
  distributed	
  systems."	
  PODC.	
  2000	
  http://www.cs.berkeley.edu/~brewer/cs262b-­‐2004/PODC-­‐keynote.pdf	
  	
  
NOSQL STORES:AVAILABILITY AND PERFORMANCE
¡  Replication
¡  Copy data across multiple servers (each bit of data can be found in
multiple servers)
¡  Increase data availability
¡  Faster query evaluation
¡  Sharding
¡  Distribute different data across multiple servers
¡  Each server acts as the single source of a data subset
¡  Orthogonal techniques
17
REPLICATION: PROS & CONS
¡  Data is more available
¡  Failure of a site containing E does not result in
unavailability of E if replicas exist
¡  Performance
¡  Parallelism: queries processed in parallel on several
nodes
¡  Reduce data transfer for local data
¡  Increased updates cost
¡  Synchronisation: each replica must be updated
¡  Increased complexity of concurrency control
¡  Concurrent updates to distinct replicas may lead to
inconsistent data unless special concurrency control
mechanisms are implemented
18
SHARDING:WHY IS IT USEFUL?
¡  Scaling applications by reducing data sets in any single
databases
¡  Segregating data
¡  Sharing application data
¡  Securing sensitive data by isolating it
¡  Improve read and write performance
¡  Smaller amount of data in each user group implies faster querying
¡  Isolating data into smaller shards accessed data is more likely to
stay on cache
¡  More write bandwidth: writing can be done in parallel
¡  Smaller data sets are easier to backup, restore and manage
¡  Massively work done
¡  Parallel work: scale out across more nodes
¡  Parallel backend: handling higher user loads
¡  Share nothing: very few bottlenecks
¡  Decrease resilience improve availability
¡  If a box goes down others still operate
¡  But: Part of the data missing
19
Load%balancer%
Cache%1%
Cache%2%
Cache%3%
MySQL%
Master%
MySQL%
Master%
Web%1%
Web%2%
Web%3%
Site%database%
Resume%database%
SHARDING AND REPLICATION
¡  Sharding with no replication: unique copy, distributed data sets
¡  (+) Better concurrency levels (shards are accessed independently)
¡  (-) Cost of checking constraints, rebuilding aggregates
¡  Ensure that queries and updates are distributed across shards
¡  Replication of shards
¡  (+) Query performance (availability)
¡  (-) Cost of updating, of checking constraints, complexity of concurrency control
¡  Partial replication (most of the times)
¡  Only some shards are duplicated
20
NOSQL STORES: DATA MANAGEMENT PROPERTIES
¡  Indexing
¡  Distributed hashing like Memcached open source cache
¡  In-memory indexes are scalable when distributing and
replicating objects over multiple nodes
¡  Partitioned tables
¡  High availability and scalability: eventual consistency
¡  Data fetched are not guaranteed to be up-to-date
¡  Updates are guaranteed to be propagated to all nodes
eventually
¡  Shared nothing horizontal scaling
¡  Replicating and partitioning data over many servers
¡  Support large number of simple read/write operations
per second (OLTP)
¡  No ACID guarantees
¡  Updates eventually propagated but limited guarantees
on reads consistency
¡  BASE: basically available; soft state, eventually consistent
¡  Multi-version concurrency control
21
COMPARING NOSQL & NEWSQL SYSTEMS
SYSTEM CONCURRENCY
CONTROL
DATA
STORAGE
REPLICATION TRANSACTION
Redis Locks RAM Asynchronous No
Scalaris Locks RAM Synchronous Local
Tokyo Locks RAM/Disk Asynchronous Local
Voldemort MVCC RAM/BDB Asynchronous No
Riak MVCC Plug in Asynchronous No
Membrain Locks Flash+Disk Synchronous Local
Membase Locks Disk Synchronous Local
Dynamo MVCC Plug in Asynchronous No
SimpleDB Non S3 Asynchronous No
MongoDB Locks Disk Asynchronous No
CouchDB MVCC Disk Asynchronous No
22
SYSTEM CONCURRENCY
CONTROL
DATA
STORAGE
REPLICATION TRANSACTION
Terrastore Locks RAM+ Synchronous L
Hbase Locks HADOOP Asynchronous L
HyperTable Locks Files Synchronous L
Cassandra MVCC Disk Asynchronous L
BigTable Locs+stamps GFS Both L
PNuts MVCC Disk Asynchronous L
MySQL-C ACID Disk Synchronous Y
VoltDB ACID/no Lock RAM Synchronous Y
Clustrix ACID/no Lock Disk Synchronous Y
ScaleDB ACID Disk Synchronous Y
ScaleBase ACID Disk Asynchronous Y
NimbusDB ACID/no Lock Disk Synchronous Y
Key-­‐Value	
  Document	
  
Extended	
  records	
  Relational	
  
Cattell,	
  Rick.	
  "Scalable	
  SQL	
  and	
  NoSQL	
  data	
  stores."	
  ACM	
  SIGMOD	
  Record	
  39.4	
  (2011):	
  12-­‐27	
  
AGENDA
ü  Dealing with multiform and multimedia data collections:
ü  it’s all about management requirements
ü  Polyglot persistence general principle
ü  The NoSQL plethora
¡  Polyglot database
¡  Design
¡  Deployment
¡  Querying, update and maintenance
¡  Conclusions and outlook
23
DESIGNING AND BUILDING A POLYGLOT DATABASE
24
25
OBJECTIVE
¡  Build a MyNet app based on a polyglot database for building an integrated directory
of my contacts including their status and posts from several social networks
MyNet	
  DB	
  MyNet	
  App	
   Social	
  network	
  
26
Analysis on contacts networks, overlapping
according to interests, posts topics
Top 10 most popular contacts
User sessions in different
Social networks
Integrating posts
from all networks
Contact graph traversal
For building groups out of
Common characteristics
Synchronizing posts to all
SNFriends network
User accounts activity
In different social networks
Directory synchronisation
Integrating contacts’ information
From all SN
NOSQL DESIGN AND CONSTRUCTION PROCESS
¡  Data reside in RAM (memcached) and is eventually replicated and stored
¡  Querying = designing a database according to the type of queries / map reduce model
¡  “On demand” data management: the database is virtually organized per view (external schema) on cache and some
view are made persistent
¡  An elastic easy to evolve and explicitly configurable architecture 27
Database
querying
Database
population
Database
organization
INDEX	
   Memcached
Replicated
Stored
GENERATING NOSQL PROGRAMS FROM HIGH LEVEL
ABSTRACTIONS
28
UML class diagram application classes
Spring Roo
Java	
  
web	
  
App	
   Spring Data
Graph database
Relational database
High-level abstractions
Low-level abstractions
http://code.google.com/p/model2roo/	
  	
  
DEPLOYING A POLYGLOT DATABASE
29
DEPLOYING A DATABASE APPLICATION ON THE CLOUD
Have a first experience on
¡  creating a database on a relational DBMS deployed on the cloud,
¡  building a simple database application exported as a service,
¡  deploying the service on the cloud and implement an interface.
30
MyNet	
  DB	
  
MyNet	
  App	
  
Contact'
idContact:"Integer"
firstName:"String"
lastName:"String"
Service	
  
App	
  
SHARDING AND DEPLOYING SHARDS ON THE CLOUD
Have a first experience on
¡  sharding a relational DBMS
¡  dealing with shards synchronization
¡  deploying the service on the cloud and implement an interface
31
MyNet	
  DB	
  
MyNet	
  
Service	
  
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String( <<'hasBasicInfo'>>'
1' *'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
Content'
<<'hasContent'>>'
<<publishesPost>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
*'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
*'
1'
<<'hasEmail'>>'
ID:(Integer,(
Lang:(String(
Language'
*'
<<'speaksLanguage'>>'
Français	
  
Español	
  
English	
  
Others	
  
REST	
  
JSON	
  documents	
  
MyNet	
  	
  
MULTI-CLOUD POLYGLOT DATABASE
MyNetContacts	
  
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String(
<<'hasBasicInfo'>>'
<<'isContactof'>>'
groupName:(String(
Group'
<<'isComposedof'>>'
1' *'
1'
*'1'
*'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
image:(Jpeg(
video:(Avi(
Content'
<<'hasContent'>>'
<<'publishesPost'>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
1'
*'
number:(String(
type:({pers,(prof}(
Phone'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
<<'hasOtherPhones'>>'
1'
*'
1'
*'
*'
<<'hasMobilePhones'>>'
1'
<<'hasEmail'>>'
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String(
<<'hasBasicInfo'>>'
<<'isContactof'>>'
groupName:(String(
Group'
<<'isComposedof'>>'
1' *'
1'
*'1'
*'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
image:(Jpeg(
video:(Avi(
Content'
<<'hasContent'>>'
<<'publishesPost'>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
1'
*'
number:(String(
type:({pers,(prof}(
Phone'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
<<'hasOtherPhones'>>'
1'
*'
1'
*'
*'
<<'hasMobilePhones'>>'
1'
<<'hasEmail'>>'
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String(
<<'hasBasicInfo'>>'
<<'isContactof'>>'
groupName:(String(
Group'
<<'isComposedof'>>'
1' *'
1'
*'1'
*'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
image:(Jpeg(
video:(Avi(
Content'
<<'hasContent'>>'
<<'publishesPost'>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
1'
*'
number:(String(
type:({pers,(prof}(
Phone'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
<<'hasOtherPhones'>>'
1'
*'
1'
*'
*'
<<'hasMobilePhones'>>'
1'
<<'hasEmail'>>'
MANAGING A POLYGLOT DATABASE
QUERYING, INSERTING, MAINTAINING
33
CRUD OPERATIONS
34
Contact
Group
Content
Post
Id	
   FirstName	
   LastName	
   Society	
  
Consistent view of data
EXAMPLE 1: SYNCHRONIZING REDIS+MYSQL
35
https://oracleus.activeevents.com/connect/sessionDetail.ww?SESSION_ID=4775	
  	
  
Updating REDIS #FAIL
begin	
  MySQL	
  transaction	
  
	
  update	
  MySQL	
  
	
  update	
  Redis	
  
rollback	
  MySQL	
  transaction	
  
begin	
  MySQL	
  transaction	
  
	
  update	
  MySQL	
  
commit	
  MySQL	
  transaction	
  
<<	
  system	
  crashes	
  >>	
  
update	
  Redis	
  
Redis has updated MySQL does not
MySQL has updated Redis does not
EXAMPLE 1: UPDATING REDIS RELIABLY
Step I Step 2
36
begin	
  MySQL	
  transaction	
  
	
  update	
  MySQL	
  
	
  queue	
  CRUD	
  event	
  in	
  MySQL	
  
commit	
  transaction	
  
Event	
  Id	
  
Operation:	
  Create,	
  Update,	
  Delete	
  
queue	
  CRUD	
  event	
  in	
  MySQL	
  
New	
  entity	
  state,	
  e.g.	
  JSON	
  
ACID
for	
  each	
  CRUD	
  event	
  in	
  MySQL	
  queue	
  
	
  
get	
  next	
  CRUD	
  event	
  from	
  MySQL	
  queue	
  
if	
  CRUD	
  event	
  is	
  not	
  duplicate	
  then	
  
	
  update	
  Redis	
  (incl.	
  eventID)	
  
end	
  if	
  
	
  
begin	
  MySQL	
  transaction	
  
	
  mark	
  CRUD	
  event	
  processed	
  
commit	
  transaction	
  
	
  
end	
  for	
  each	
  
EXAMPLE 1: UPDATING REDIS RELIABLY
37
EntityCRUDEvent	
  
Repository	
  
EntityCRUDEvent	
  
Processor	
  
Redis	
  updater	
  
ID	
   JSON	
   Processed?	
  
INSERT	
  INTO	
  ..	
   SELECT	
  …	
  FROM..	
  
apply(event)	
  
Timer	
  
Step 1 Step 2
EXAMPLE 1:TRACKING CHANGES (HIBERNATE)
38
POLYGLOT DATABASE EVOLUTION
¡  Problem statement:
¡  Evolution of the application: modification of classes, new
classes, new relationships among classes
¡  Evolution of the “entities” managed in the polyglot
database
¡  Some change structure, change values, …
¡  The content of the stores start deriving from the
application data structures
¡  Which is the current structure of the entities stored?
¡  Are there elements that are not being accessed because
they do not longer correspond to the application data
structures?
39
Contact'
idContact:"Integer"
firstName:"String"
lastName:"String"
webSite:(URI(
socialNetworkID:(URI((
(
Basic(Info(
Contact'
idContact:(Integer(
lastName:(String(
givenName:(String(
society:(String( <<'hasBasicInfo'>>'
1' *'
postID:(Integer(
timeStamp:(Date(
geoStamp:(String(
Post'
contentID:(Integer(
text:(String(
Content'
<<'hasContent'>>'
<<publishesPost>>'
*'
1'
1' 1'
street:(String,((
number:(Integer,(
City:(Sting,((
Zipcode:(Integer(
Address'
*'
email:(String(
type:({pers,(prof}(
Email'
<<'hasAddress'>>'
*'
1'
<<'hasEmail'>>'
ID:(Integer,(
Lang:(String(
Language'
*'
<<'speaksLanguage'>>'
EXTRACTING SCHEMATA FROM NOSQL PROGRAMS CODE
http://code.google.com/p/exschema/	
  	
  
40
Metalayer(
Struct( Relationship(
Attribute(Set(
*( *(
*(
*(
*( *(
*(
*(*(
*(
*(
*(
*(
*(
Declarations	
  
analyzer	
  
Updates	
  
analyzer	
  
Repositories	
  
analyzer	
  
Schema1 Schema2
Schema3
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.Tweet
Struct
Attribute
name : text
Attribute
name : userId
Set
Attribute
implementation : Neo4j
Struct
Attribute
name : fr.imag.twitter.domain.User
Str
Attribute
name : nodeId
Attribute
name : userName
A
na
Relatio
start end
Schema1 Schema2
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Attribute
name : fr.imag.twitter.domain.Tweet
Attribute
name : text
Attri
name :
Schema1 Schema2
Schema3
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.UserInfo
Struct
Attribute
name : userId
Attribute
name : firstName
Attribute
name : lastName
Set
Attribute
implementation : Spring Repository
Set
Attribute
name : fr.imag.twitter.domain.Tweet
Struct
Attribute
name : text
Attribute
name : userId
Set
Attribute
implementation : Neo4j
Struct
Attribute
name : fr.imag.twitter.domain.User
Struct
Attribute
name : nodeId
Attribute
name : userName
Attribute
name : userId
Attribute
name : password
Relationship
start end
Attribute
name : followers
Spring Data
EXTRACTING SCHEMATA FROM A POLYGLOT DATABASE
41
Application	
  source	
  code	
  
Metalayer(
Struct( Relationship(
Attribute(Set(
*( *(
*(
*(
*( *(
*(
*(*(
*(
*(
*(
*(
*(
Declarations	
  
Analyzer	
  
Updates	
  
Analyzer	
  
Annotations	
  
Analyzer	
  
Repositories	
  
Analyzer	
  
ExSchema	
  
Variables/Methods	
  Graph	
  Store	
  
Annotations/Variables	
  	
  
Relational	
  &	
  	
  
document	
  stores	
  
AGENDA
ü  Dealing with multiform and multimedia data collections:
ü  it’s all about management requirements
ü  Polyglot persistence general principle
ü  The NoSQL plethora
ü  Polyglot database
ü  Design
ü  Deployment
ü  Querying, update and maintenance
¡  Conclusions and outlook
42
CHALLENGE: POLYGLOT MEETS XPERANTO
Given a data collection coming from different social networks stored on NoSQL systems (Neo4J and Mongo) [possibly
according to a strategy combining sharding and replication techniques], extend the UnQL pivot query language considering
¡  Data processing operators adapted to query different data models (graph, document). Example query Neo + Mongo
and what about Join, Union …
¡  Assuming concurrent CRUD operations to the stores can you expect query results to be consistent ? How can you
tag your results or implement a sharding strategy in order to determine whether results are consistent?
¡  Querying data represented on different models: How can you exploit the structure of the different stores for
expressing queries ? Provide adapted operators? Give generic operators and then rewrite queries?
¡  Normally, Polyglot solutions tend to solve some data processing issues in the application code.This can be penalizing.
Discuss the challenges to address for ensuring that your queries will be able to scale as the collection grows.
43
CHALLENGE: EXPECTED RESULTS
¡  Give the principle of your proposal through a partial programming solution, of the operators of your
UnQL extension, detail the query evaluation process if U want your solution to scale
¡  We ask U to sketch the solution on the polyglot database that we provide consisting of mongo, Neo4J stores
¡  https://github.com/jccastrejon/edbt-unql
¡  Technical requirements:VMware player 5
44
WHEN IS POLYGLOT PERSISTENCE PERTINENT?
¡  Application essentially composing and serving web pages
¡  They only looked up page elements by ID, they had different needs or availability,
concurrency and no need to share all their data
¡  A problem like this is much better suited to a NoSQL store than the corporate relational
DBMS
¡  Scaling to lots of traffic gets harder and harder to do with vertical scaling
¡  Many NoSQL databases are designed to operate over clusters
¡  They can tackle larger volumes of traffic and data than is realistic with a single server
45
CONCLUSIONS
¡  Data are growing big and more heterogeneous and they need new adapted ways to be managed thus the
NoSQL movement is gaining momentum
¡  Data heterogeneity implies different management requirements this is where polyglot persistence comes up
¡  Consistency – Availability – Fault tolerance theorem: find the balance !
¡  Which data store according to its data model?
¡  A lot of programming implied …
46
Open opportunities if you’re interested in this topic!
Contact: GenovevaVargas-Solar, CNRS, LIG-LAFMIA
Genoveva.Vargas@imag.fr	
  
http://www.vargas-­‐solar.com/teaching/	
  	
  	
  	
  
http://code.google.com/p/exschema/	
  	
  
47
http://code.google.com/p/model2roo/	
  	
  
Open source polyglot persistence tools
Prof.	
  Dr.	
  Christine	
  Collet	
  
Grenoble	
  INP	
  
France	
  
Juan	
  Carlos	
  Castrejón	
  
University	
  of	
  Grenoble	
  
France	
  
Javier	
  Espinosa	
  
University	
  of	
  Grenoble	
  
France	
  
Dr.	
  Genoveva	
  Vargas-­‐Solar	
  
CNRS,	
  LIG-­‐LAFMIA	
  
France	
  
REFERENCES
¡  Eric	
  A.,	
  Brewer	
  "Towards	
  robust	
  distributed	
  systems."	
  PODC.	
  2000	
  
¡  Rick,	
  Cattell	
  "Scalable	
  SQL	
  and	
  NoSQL	
  data	
  stores."	
  ACM	
  SIGMOD	
  Record	
  39.4	
  (2011):	
  12-­‐27	
  
¡  Juan	
   Castrejon,	
   Genoveva	
   Vargas-­‐Solar,	
   Christine	
   Collet,	
   and	
   Rafael	
   Lozano,	
   ExSchema:	
  
Discovering	
  and	
  Maintaining	
  Schemas	
  from	
  Polyglot	
  Persistence	
  Applications,	
  In	
  Proceedings	
  of	
  
the	
  International	
  Conference	
  on	
  Software	
  Maintenance,	
  Demo	
  Paper,	
  IEEE,	
  2013	
  	
  
¡  M.	
  Fowler	
  and	
  P.	
  Sadalage.	
  NoSQL	
  Distilled:	
  A	
  Brief	
  Guide	
  to	
  the	
  Emerging	
  World	
  of	
  Polyglot	
  
Persistence.	
  Pearson	
  Education,	
  Limited,	
  2012	
  
¡  C.	
   Richardson,	
   Developing	
   polyglot	
   persistence	
   applications,	
   http://fr.slideshare.net/
chris.e.richardson/developing-­‐polyglotpersistenceapplications-­‐gluecon2013	
  
49
NOSQL PLETHORA
50
DATA MODELS
¡  Tuple
¡  Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar
¡  Document
¡  Allows values to be nested documents or lists, as well as scalar values.
¡  Attributes are not defined in a global schema
¡  Extensible record
¡  Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be added
on a per-record basis
51
DATA STORES
¡  Key-value
¡  Systems that store values and an index to find them, based on a key
¡  Document
¡  Systems that store documents, providing index and simple query mechanisms
¡  Extensible record
¡  Systems that store extensible records that can be partitioned vertically and horizontally across nodes
¡  Graph
¡  Systems that store model data as graphs where nodes can represent content modelled as document or key-value structures and arcs
represent a relation between the data modelled by the node
¡  Relational
¡  Systems that store, index and query tuples
52
KEY-VALUE STORES
¡  “Simplest data stores” use a data model similar to
the memcached distributed in-memory cache
¡  Single key-value index for all data
¡  Provide a persistence mechanism
¡  Replication, versioning, locking, transactions, sorting
¡  API: inserts, deletes, index lookups
¡  No secondary indices or keys
53
SYSTEM ADDRESS
Redis code.google.com/p/redis	
  
Scalaris code.google.com/p/scalaris	
  
Tokyo tokyocabinet.sourceforge.net	
  
Voldemort project-­‐voldemort.com	
  
Riak riak.basho.com	
  
Membrain schoonerinfotech.com/products	
  
Membase membase.com	
  
SELECT	
  	
  name	
  
FROM	
  	
  	
  	
  group	
  
WHERE	
  	
  	
  gid	
  IN	
  (	
  SELECT	
  	
  gid	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  	
  	
  	
  group_member	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  WHERE	
  	
  	
  uid	
  =	
  me()	
  )	
  
54
SELECT	
  	
  name,	
  pic,	
  profile_url	
  
FROM	
  	
  	
  	
  user	
  
WHERE	
  	
  	
  uid	
  =	
  me()	
  
SELECT	
  	
  name,	
  pic	
  
FROM	
  	
  	
  	
  user	
  
WHERE	
  	
  	
  online_presence	
  =	
  "active"	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  AND	
  
	
  	
  	
  	
  	
  	
  	
  	
  uid	
  IN	
  (	
  SELECT	
  	
  uid2	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  FROM	
  	
  	
  	
  friend	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  WHERE	
  	
  	
  uid1	
  =	
  me()	
  )	
  
SELECT	
  	
  name	
  
FROM	
  	
  	
  	
  friendlist	
  
WHERE	
  	
  	
  owner	
  =	
  me()	
  
SELECT	
  	
  message,	
  attachment	
  
FROM	
  	
  	
  	
  stream	
  
WHERE	
  	
  	
  source_id	
  =	
  me()	
  AND	
  type	
  =	
  80	
  
https://developers.facebook.com/docs/reference/fql/	
  
55
<805114856, 	
   	
  	
  	
  	
  	
  	
  	
  
>
DOCUMENT STORES
¡  Support more complex data: pointerless objects, i.e.,
documents
¡  Secondary indexes, multiple types of documents
(objects) per database, nested documents and lists, e.g.
B-trees
¡  Automatic sharding (scale writes), no explicit locks,
weaker concurrency (eventual for scaling reads) and
atomicity properties
¡  API: select,	
  delete,	
  getAttributes,	
  
putAttributes	
  on documents
¡  Queries can be distributed in parallel over multiple
nodes using a map-reduce mechanism
56
SYSTEM ADDRESS
SimpleDB amazon.com/simpledb	
  
Couch DB couchdb.apache.org	
  
Mongo DB mongodb.org	
  
Terrastore code.google.com/terrastore	
  
57
DOCUMENT STORES
EXTENSIBLE RECORD STORES
¡  Basic data model is rows and columns
¡  Basic scalability model is splitting rows and columns over
multiple nodes
¡  Rows split across nodes through sharding on the primary key
¡  Split by range rather than hash function
¡  Rows analogous to documents: variable number of attributes, attribute
names must be unique
¡  Grouped into collections (tables)
¡  Queries on ranges of values do not go to every node
¡  Columns are distributed over multiple nodes using “column
groups”
¡  Which columns are best stored together
¡  Column groups must be pre-defined with the extensible record
stores
58
SYSTEM ADDRESS
HBase hbase.apache.com	
  
HyperTable hypertable.org	
  
Cassandra incubator.apache.org/cassandra	
  
SCALABLE RELATIONAL SYSTEMS
¡  SQL: rich declarative query language
¡  Databases reinforce referential integrity
¡  ACID semantics
¡  Well understood operations:
¡  Configuration, Care and feeding, Backups,Tuning, Failure and recovery,
Performance characteristics
¡  Use small-scope operations
¡  Challenge: joins that do not scale with sharding
¡  Use small-scope transactions
¡  ACID transactions inefficient with communication and 2PC overhead
¡  Shared nothing architecture for scalability
¡  Avoid cross-node operations
59
SYSTEM ADDRESS
MySQL C mysql.com/cluster	
  
Volt DB voltdb.com	
  
Clustrix clustrix.com	
  
ScaleDB scaledb.com	
  
Scale Base scalebase.com	
  
Nimbus DB nimbusdb.com	
  
NOSQL STORES:AVAILABILITY AND PERFORMANCE
60
REPLICATION MASTER - SLAVE
¡  Makes one node the authoritative copy/replica that handles
writes while replica synchronize with the master and may
handle reeds
¡  All replicas have the same weight
¡  Replicas can all accept writes
¡  The lose of one of them does not prevent access to the data store
¡  Helps with read scalability but does not help with write
scalability
¡  Read resilience: should the master fail, slaves can still handle
read requests
¡  Master failure eliminates the ability to handle writes until
either the master is restored or a new master is appointed
¡  Biggest complication is consistency
¡  Possible write – write conflict
¡  Attempt to update the same record at the same time from to
different places
¡  Master is a bottle-neck and a point of failure
61
Master'
Slaves'
all'updates'
made'to'the'master'
changes'propagate''
To'slaves'
reads'can'be'done'
from'master'or'slaves'
MASTER-SLAVE REPLICATION MANAGEMENT
¡  Masters can be appointed
¡  Manually when configuring the nodes cluster
¡  Automatically: when configuring a nodes cluster one of them elected as master.The master can appoint a new master when the master
fails reducing downtime
¡  Read resilience
¡  Read and write paths have to be managed separately to handle failure in the write path and still reads can occur
¡  Reads and writes are put in different database connections if the database library accepts it
¡  Replication comes inevitably with a dark side: inconsistency
¡  Different clients reading different slaves will see different values if changes have not been propagated to all slaves
¡  In the worst case a client cannot read a write it just made
¡  Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost
62
REPLICATION: PEER-TO-PEER
¡  Allows writes to any node; the nodes coordinate
to synchronize their copies
¡  The replicas have equal weight
¡  Deals with inconsistencies
¡  Replicas coordinate to avoid conflict
¡  Network traffic cost for coordinating writes
¡  Unnecessary to make all replicas agree to write, only
the majority
¡  Survival to the loss of the minority of replicas nodes
¡  Policy to merge inconsistent writes
¡  Full performance on writing to any replica
63
Master'
nodes'communicate'
their'writes'
all'nodes'read'
and'write'all'data'
SHARDING
¡  Ability to distribute both data and load of simple
operations over many servers, with no RAM or disk
shared among servers
¡  A way to horizontally scale writes
¡  Improve read performance
¡  Application/data store support
64
¡  Puts different data on separate nodes
¡  Each user only talks to one servicer so she gets rapid
responses
¡  The load should be balanced out nicely between
servers
¡  Ensure that
¡  data that is accessed together is clumped together on
the same node
¡  that clumps are arranged on the nodes to provide best
data access
Each%shard%reads%and%
writes%its%own%data%
SHARDING
Database laws
¡  Small databases are fast
¡  Big databases are slow
¡  Keep databases small
Principle
¡  Start with a big monolithic database
¡  Break into smaller databases
¡  Across many clusters
¡  Using a key value
65
Instead of having one million customers information
on a single big machine ….
100 000 customers on smaller and different machines
SHARDING CRITERIA
¡  Partitioning
¡  Relational: handled by the DBMS (homogeneous DBMS)
¡  NoSQL: based on ranging of the k-value
¡  Federation
¡  Relational
¡  Combine tables stored in different physical databases
¡  Easier with denormalized data
¡  NoSQL:
¡  Store together data that are accessed together
¡  Aggregates unit of distribution
66
SHARDING
Architecture
¡  Each application server (AS) is running DBS/client
¡  Each shard server is running
¡  a database server
¡  replication agents and query agents for supporting
parallel query functionality
Process
¡  Pick a dimension that helps sharding easily (customers,
countries, addresses)
¡  Pick strategies that will last a long time as repartition/
re-sharding of data is operationally difficult
¡  This is done according to two different principles
¡  Partitioning: a partition is a structure that divides a space
into tow parts
¡  Federation: a set of things that together compose a
centralized unit but each individually maintains some aspect
of autonomy
67
Customers data is partitioned by ID in shards using an
algorithm d to determine which shard a customer ID belongs to
REPLICATION:ASPECTS TO CONSIDER
¡  Conditioning
¡  Important elements to consider
¡  Data to duplicate
¡  Copies location
¡  Duplication model (master – slave / P2P)
¡  Consistency model (global – copies)
68
Fault	
  
tolerance	
  
Availability	
  
Transparency	
  
levels	
  
Performance	
  
à Find a compromise !
PARTITIONING
A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS
69
BACKGROUND: DISTRIBUTED RELATIONAL DATABASES
¡  External schemas (views) are often subsets of relations
(contacts in Europe and America)
¡  Access defined on subsets of relations: 80% of the
queries issued in a region have to do with contacts of
that region
¡  Relations partition
¡  Better concurrency level
¡  Fragments accessed independently
¡  Implications
¡  Check integrity constraints
¡  Rebuild relations
70
FRAGMENTATION
¡  Horizontal
¡  Groups of tuples of the same relation
¡  Budget < 300 000 or >= 150 000
¡  Not disjoint are more difficult to manage
¡  Vertical
¡  Groups attributes of the same relation
¡  Separate budget from loc and pname of the relation
project
¡  Hybrid
71
FRAGMENTATION: RULES
Vertical
¡  Clustering
¡  Grouping elementary fragments
¡  Budget and location information in two relations
¡  Splitting
¡  Decomposing a relation according to affinity
relationships among attributes
Horizontal
¡  Tuples of the same fragment must be statistically
homogeneous
¡  If t1 and t2 are tuples of the same fragment then t1 and t2 have
the same probability of being selected by a query
¡  Keep important conditions
¡  Complete
¡  Every tuple (attribute) belongs to a fragment (without information
loss)
¡  If tuples where budget >= 150 000 are more likely to be selected
then it is a good candidate
¡  Minimum
¡  If no application distinguishes between budget >= 150 000 and
budget < 150 000 then these conditions are unnecessary
72
SHARDING: HORIZONTAL PARTITIONING
¡  The entities of a database are split into two or
more sets (by row)
¡  In relational: same schema several physical bases/
servers
¡  Partition contacts in Europe and America shards where
they zip code indicates where the will be found
¡  Efficient if there exists some robust and implicit way to
identify in which partition to find a particular entity
¡  Last resort shard
¡  Needs to find a sharding function: modulo, round robin,
hash – partition, range - partition
73
Load%balancer%
Cache%1%
Cache%2%
Cache%3%
MySQL%
Master%
MySQL%
Master%
Web%1%
Web%2%
Web%3%
Even%IDs%
Odd%IDs%
MySQL%
Slave%1% MySQL%
Slave%2%
MySQL%
Slave%n%
MySQL%
Slave%1% MySQL%
Slave%2%
MySQL%
Slave%n%
FEDERATION
A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH
INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY
74
FEDERATION:VERTICAL SHARDING
¡  Principle
¡  Partition data according to their logical affiliation
¡  Put together data that are commonly accessed
¡  The search load for the large partitioned entity can
be split across multiple servers (logical and
physical) and not only according to multiple
indexes in the same logical server
¡  Different schemas, systems, and physical bases/
servers
¡  Shards the components of a site and not only data
75
Load%balancer%
Cache%1%
Cache%2%
Cache%3%
MySQL%
Master%
MySQL%
Master%
Web%1%
Web%2%
Web%3%
Site%database%
Resume%database%
MySQL%
Slave%1% MySQL%
Slave%2%
MySQL%
Slave%n%
MySQL%
Slave%1%
Internal%
user%
NOSQL STORES: PERSISTENCY MANAGEMENT
76
«MEMCACHED»
¡  «memcached» is a memory management protocol based on a cache:
¡  Uses the key-value notion
¡  Information is completly stored in RAM
¡  «memcached» protocol for:
¡  Creating, retrieving, updating, and deleting information from the database
¡  Applications with their own «memcached» manager (Google, Facebook,
YouTube, FarmVille,Twitter,Wikipedia)
77
STORAGE ON DISC (1)
¡  For efficiency reasons, information is stored using the RAM:
¡  Work information is in RAM in order to answer to low latency requests
¡  Yet, this is not always possible and desirable
Ø  The process of moving data from RAM to disc is called "eviction”; this
process is configured automatically for every bucket
78
STORAGE ON DISC (2)
¡  NoSQL servers support the storage of key-value pairs on disc:
¡  Persistency–can be executed by loading data, closing and reinitializing it
without having to load data from another source
¡  Hot backups– loaded data are sotred on disc so that it can be
reinitialized in case of failures
¡  Storage on disc– the disc is used when the quantity of data is higher
thant the physical size of the RAM, frequently used information is
maintained in RAM and the rest es stored on disc
79
STORAGE ON DISC (3)
¡  Strategies for ensuring:
¡  Each node maintains in RAM information on the key-value pairs it stores. Keys:
¡  may not be found, or
¡  they can be stored in memory or on disc
¡  The process of moving information from RAM to disc is asynchronous:
¡  The server can continue processing new requests
¡  A queue manages requests to disc
Ø  In periods with a lot of writing requests, clients can be notified that the server is
termporaly out of memory until information is evicted
80
NOSQL STORES: CONCURRENCY CONTROL
81
MULTIVERSION CONCURRENCY CONTROL (MVCC)
¡  Objective: Provide concurrent access to the database and in programming languages to implement transactional
memory
¡  Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could
see a half-written or inconsistent piece of data.
¡  Lock: readers wait until the writer is done
¡  MVCC:
¡  Each user connected to the database sees a snapshot of the database at a particular instant in time
¡  Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been
committed
¡  When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à
multiple versions stored, but only one is the latest
¡  Writes can be isolated by virtue of the old versions being maintained
¡  Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects
82

More Related Content

What's hot

NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introductionPooyan Mehrparvar
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureVenu Anuganti
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsDATAVERSITY
 
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsWhat Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsTodd Hoff
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQLDon Demcsak
 
cassandra
cassandracassandra
cassandraAkash R
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151xlight
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...DataStax
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرdatastack
 

What's hot (19)

NoSQL databases - An introduction
NoSQL databases - An introductionNoSQL databases - An introduction
NoSQL databases - An introduction
 
Big data rmoug
Big data rmougBig data rmoug
Big data rmoug
 
NoSQL and Couchbase
NoSQL and CouchbaseNoSQL and Couchbase
NoSQL and Couchbase
 
SQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data ArchitectureSQL, NoSQL, BigData in Data Architecture
SQL, NoSQL, BigData in Data Architecture
 
NoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture PatternsNoSQL Now! NoSQL Architecture Patterns
NoSQL Now! NoSQL Architecture Patterns
 
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web ApplicationsWhat Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
What Should I Do? Choosing SQL, NoSQL or Both for Scalable Web Applications
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
ECX Solution Sheet
ECX Solution SheetECX Solution Sheet
ECX Solution Sheet
 
Intro to Big Data and NoSQL
Intro to Big Data and NoSQLIntro to Big Data and NoSQL
Intro to Big Data and NoSQL
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
RDBMS vs NoSQL
RDBMS vs NoSQLRDBMS vs NoSQL
RDBMS vs NoSQL
 
cassandra
cassandracassandra
cassandra
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
NOSQL
NOSQLNOSQL
NOSQL
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151http://www.hfadeel.com/Blog/?p=151
http://www.hfadeel.com/Blog/?p=151
 
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
Webinar: ROI on Big Data - RDBMS, NoSQL or Both? A Simple Guide for Knowing H...
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 

Similar to Vargas polyglot-persistence-cloud-edbt

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothAdaryl "Bob" Wakefield, MBA
 
Big Data 2107 for Ribbon
Big Data 2107 for RibbonBig Data 2107 for Ribbon
Big Data 2107 for RibbonSamuel Dratwa
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيMohamed Galal
 
Building High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic ApplicationsBuilding High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic ApplicationsCalpont
 
Building High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic ApplicationsBuilding High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic Applicationsguest40cda0b
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxRadhika R
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016StampedeCon
 
No sql databases
No sql databases No sql databases
No sql databases Ankit Dubey
 
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...In-Memory Computing Summit
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperDavid Walker
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...IJCERT JOURNAL
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcturesabnees
 
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdfBest-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdfssuserf8f9b2
 

Similar to Vargas polyglot-persistence-cloud-edbt (20)

The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need BothThe Marriage of the Data Lake and the Data Warehouse and Why You Need Both
The Marriage of the Data Lake and the Data Warehouse and Why You Need Both
 
Big Data 2107 for Ribbon
Big Data 2107 for RibbonBig Data 2107 for Ribbon
Big Data 2107 for Ribbon
 
مقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربيمقدمة عن NoSQL بالعربي
مقدمة عن NoSQL بالعربي
 
Building High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic ApplicationsBuilding High Performance MySQL Query Systems and Analytic Applications
Building High Performance MySQL Query Systems and Analytic Applications
 
Building High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic ApplicationsBuilding High Performance MySql Query Systems And Analytic Applications
Building High Performance MySql Query Systems And Analytic Applications
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Nosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptxNosql-Module 1 PPT.pptx
Nosql-Module 1 PPT.pptx
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big Data NoSQL 1017
Big Data NoSQL 1017Big Data NoSQL 1017
Big Data NoSQL 1017
 
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
Best Practices For Building and Operating A Managed Data Lake - StampedeCon 2016
 
No sql databases
No sql databases No sql databases
No sql databases
 
Kafka & Hadoop in Rakuten
Kafka & Hadoop in RakutenKafka & Hadoop in Rakuten
Kafka & Hadoop in Rakuten
 
The NoSQL Movement
The NoSQL MovementThe NoSQL Movement
The NoSQL Movement
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
IMC Summit 2016 Breakout - Pandurang Naik - Demystifying In-Memory Data Grid,...
 
EOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - PaperEOUG95 - Client Server Very Large Databases - Paper
EOUG95 - Client Server Very Large Databases - Paper
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
SURVEY ON IMPLEMANTATION OF COLUMN ORIENTED NOSQL DATA STORES ( BIGTABLE & CA...
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdfBest-Practices-for-Using-Tableau-With-Snowflake.pdf
Best-Practices-for-Using-Tableau-With-Snowflake.pdf
 

More from Genoveva Vargas-Solar

Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceGenoveva Vargas-Solar
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYGenoveva Vargas-Solar
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPCGenoveva Vargas-Solar
 

More from Genoveva Vargas-Solar (10)

Aiccsa 2021-w-stem
Aiccsa 2021-w-stemAiccsa 2021-w-stem
Aiccsa 2021-w-stem
 
Talk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial IntelligenceTalk straps: Interactivity between Human and Artificial Intelligence
Talk straps: Interactivity between Human and Artificial Intelligence
 
Data w-steamm
Data w-steammData w-steamm
Data w-steamm
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEYFROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
FROM GRADUATE SCHOOL TO PROFESSIONAL LIFE PREPARING A LONG JOURNEY
 
Moving forward data centric sciences weaving AI, Big Data & HPC
Moving forward data centric sciences  weaving AI, Big Data & HPCMoving forward data centric sciences  weaving AI, Big Data & HPC
Moving forward data centric sciences weaving AI, Big Data & HPC
 
3 map reduce perspectives
3 map reduce perspectives3 map reduce perspectives
3 map reduce perspectives
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
1 mapreduce-fest
1 mapreduce-fest1 mapreduce-fest
1 mapreduce-fest
 
Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 

Recently uploaded

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

Vargas polyglot-persistence-cloud-edbt

  • 1. POLYGLOT PERSISTENCE AND MULTI-CLOUD DATA MANAGEMENT SOLUTIONS GENOVEVAVARGAS SOLAR FRENCH COUNCIL OF SCIENTIFIC RESEARCH, LIG-LAFMIA, FRANCE Genoveva.Vargas@imag.fr     http://www.vargas-­‐solar.com  
  • 2. DATA ALL AROUND 2 30 billion pieces of content are shared on Facebook every month 40 billion+ hours video are watched onYouTube each month As of 2011 the global size Data in healthcare was estimated to be 150 Exabytes (161 billion of Gigabytes) 40 millionTweets are sent per day about 200 monthly active users By 2014 it is anticipated there will be 400 million wearable wireless health monitors Web 2.0 sites where millions of users may both read and write data, scalability for simple database operations has become important Data collections available through front ends managed through public/private organization, available through the Web, e.g., Sloan Sky Server
  • 3. STORING AND ACCESSING HUGE AMOUNTS OF DATA Peta  1015   Exa  1018   Zetta  1021   Yota  1024   RAID   Disk   Cloud   3 •  Data formats •  Data collection sizes •  Data storage supports •  Data delivery mechanisms
  • 4. 4
  • 5. THIS TALK IS NOT ABOUT 5 http://nosql-­‐database.org   Debate on whether NoSQL stores and relational systems are better or worse … that is not the point Of course we can surf on these waves at the end of the talk and during EDBT School!
  • 6. THIS TALK IS ABOUT 6 alternative for managing multiform and multimedia data collections according to different properties and requirements
  • 7. AGENDA ¡  Dealing with multiform and multimedia data collections: it’s all about management requirements ¡  Polyglot persistence general principle ¡  The NoSQL plethora ¡  Polyglot database ¡  Design ¡  Deployment ¡  Querying, update and maintenance ¡  Conclusions and outlook 7
  • 8. 8
  • 9. POLYGLOT PERSISTENCE ¡  Polyglot Programming: applications should be written in a mix of languages to take advantage of different languages are suitable for tackling different problems ¡  Polyglot persistence: any decent sized enterprise will have a variety of different data storage technologies for different kinds of data ¡  a new strategic enterprise application should no longer be built assuming a relational persistence support ¡  the relational option might be the right one - but you should seriously look at other alternatives 9 M.  Fowler  and  P.  Sadalage.  NoSQL  Distilled:  A  Brief  Guide  to  the  Emerging  World  of  Polyglot  Persistence.  Pearson  Education,  Limited,  2012  
  • 10. (Katsov-2012) Use the right tool for the right job… How do I know which is the right tool for the right job? 10
  • 11. SCALING DATABASE SYSTEMS ¡  A system is scalable if increasing its resources (CPU, memory, disk) results in increased performance proportionally to the added resources ¡  Improving performance means serving more units of work for handling larger units of work like when data sets grow ¡  Database systems have been scaled by buying bigger faster and more expensive machines 11 ¡  Vertically (SCALE UP) ¡  Add resources (CPU, memory) to a single node in a system ¡  Horizontally (SCALE OUT) ¡  Add more nodes to a system
  • 12. NOSQL STORES CHARACTERISTICS ¡  Simple operations ¡  Key lookups reads and writes of one record or a small number of records ¡  No complex queries or joins ¡  Ability to dynamically add new attributes to data records ¡  Horizontal scalability ¡  Distribute data and operations over many servers ¡  Replicate and distribute data over many servers ¡  No shared memory or disk ¡  High performance ¡  Efficient use of distributed indexes and RAM for data storage ¡  Weak consistency model ¡  Limited transactions 12 Next generation databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable [http://nosql-­‐database.org]  
  • 13. DEALING WITH HUGE AMOUNTS OF DATA 13 Peta  1015   Exa  1018   Zetta  1021   Yota  1024   RAID   Disk   Cloud   Concurrency   Consistency   Atomicity   Relational   Graph   Key  value   Columns  
  • 14. DATA MANAGEMENT SYSTEMS ARCHITECTURES Physical  model   Logic  model   External  model   ANSI/SPARC   Storage   Manager   Schema   Manager   Query   Engine   Transaction   Manager   DBMS   Customisable  points   Custom  components   Glue   code   Data     services   Access   services   Storage   services   Additional   Extension   services   Other   services   Extension  services   Streaming, XML, procedures, queries, replication Physical  model   Logic  model   External  model   Physical  model   Logic  model   External  model   14
  • 15. 15 Data  stores  designed    to  scale  simple     OLTP-­‐style  applica7on  loads     •  Data  model     •  Consistency     •  Storage     •  Durability     •  Availability     •  Query  support   Read/Write  operations     by  thousands/millions  of  users  
  • 16. PROBLEM STATEMENT: HOW MUCH TO GIVE UP? ¡  CAP theorem1: a system can have two of the three properties ¡  NoSQL systems sacrifice consistency 16 Consistency  Availability   Fault-­‐tolerant     partitioning   1  Eric  Brewer,  "Towards  robust  distributed  systems."  PODC.  2000  http://www.cs.berkeley.edu/~brewer/cs262b-­‐2004/PODC-­‐keynote.pdf    
  • 17. NOSQL STORES:AVAILABILITY AND PERFORMANCE ¡  Replication ¡  Copy data across multiple servers (each bit of data can be found in multiple servers) ¡  Increase data availability ¡  Faster query evaluation ¡  Sharding ¡  Distribute different data across multiple servers ¡  Each server acts as the single source of a data subset ¡  Orthogonal techniques 17
  • 18. REPLICATION: PROS & CONS ¡  Data is more available ¡  Failure of a site containing E does not result in unavailability of E if replicas exist ¡  Performance ¡  Parallelism: queries processed in parallel on several nodes ¡  Reduce data transfer for local data ¡  Increased updates cost ¡  Synchronisation: each replica must be updated ¡  Increased complexity of concurrency control ¡  Concurrent updates to distinct replicas may lead to inconsistent data unless special concurrency control mechanisms are implemented 18
  • 19. SHARDING:WHY IS IT USEFUL? ¡  Scaling applications by reducing data sets in any single databases ¡  Segregating data ¡  Sharing application data ¡  Securing sensitive data by isolating it ¡  Improve read and write performance ¡  Smaller amount of data in each user group implies faster querying ¡  Isolating data into smaller shards accessed data is more likely to stay on cache ¡  More write bandwidth: writing can be done in parallel ¡  Smaller data sets are easier to backup, restore and manage ¡  Massively work done ¡  Parallel work: scale out across more nodes ¡  Parallel backend: handling higher user loads ¡  Share nothing: very few bottlenecks ¡  Decrease resilience improve availability ¡  If a box goes down others still operate ¡  But: Part of the data missing 19 Load%balancer% Cache%1% Cache%2% Cache%3% MySQL% Master% MySQL% Master% Web%1% Web%2% Web%3% Site%database% Resume%database%
  • 20. SHARDING AND REPLICATION ¡  Sharding with no replication: unique copy, distributed data sets ¡  (+) Better concurrency levels (shards are accessed independently) ¡  (-) Cost of checking constraints, rebuilding aggregates ¡  Ensure that queries and updates are distributed across shards ¡  Replication of shards ¡  (+) Query performance (availability) ¡  (-) Cost of updating, of checking constraints, complexity of concurrency control ¡  Partial replication (most of the times) ¡  Only some shards are duplicated 20
  • 21. NOSQL STORES: DATA MANAGEMENT PROPERTIES ¡  Indexing ¡  Distributed hashing like Memcached open source cache ¡  In-memory indexes are scalable when distributing and replicating objects over multiple nodes ¡  Partitioned tables ¡  High availability and scalability: eventual consistency ¡  Data fetched are not guaranteed to be up-to-date ¡  Updates are guaranteed to be propagated to all nodes eventually ¡  Shared nothing horizontal scaling ¡  Replicating and partitioning data over many servers ¡  Support large number of simple read/write operations per second (OLTP) ¡  No ACID guarantees ¡  Updates eventually propagated but limited guarantees on reads consistency ¡  BASE: basically available; soft state, eventually consistent ¡  Multi-version concurrency control 21
  • 22. COMPARING NOSQL & NEWSQL SYSTEMS SYSTEM CONCURRENCY CONTROL DATA STORAGE REPLICATION TRANSACTION Redis Locks RAM Asynchronous No Scalaris Locks RAM Synchronous Local Tokyo Locks RAM/Disk Asynchronous Local Voldemort MVCC RAM/BDB Asynchronous No Riak MVCC Plug in Asynchronous No Membrain Locks Flash+Disk Synchronous Local Membase Locks Disk Synchronous Local Dynamo MVCC Plug in Asynchronous No SimpleDB Non S3 Asynchronous No MongoDB Locks Disk Asynchronous No CouchDB MVCC Disk Asynchronous No 22 SYSTEM CONCURRENCY CONTROL DATA STORAGE REPLICATION TRANSACTION Terrastore Locks RAM+ Synchronous L Hbase Locks HADOOP Asynchronous L HyperTable Locks Files Synchronous L Cassandra MVCC Disk Asynchronous L BigTable Locs+stamps GFS Both L PNuts MVCC Disk Asynchronous L MySQL-C ACID Disk Synchronous Y VoltDB ACID/no Lock RAM Synchronous Y Clustrix ACID/no Lock Disk Synchronous Y ScaleDB ACID Disk Synchronous Y ScaleBase ACID Disk Asynchronous Y NimbusDB ACID/no Lock Disk Synchronous Y Key-­‐Value  Document   Extended  records  Relational   Cattell,  Rick.  "Scalable  SQL  and  NoSQL  data  stores."  ACM  SIGMOD  Record  39.4  (2011):  12-­‐27  
  • 23. AGENDA ü  Dealing with multiform and multimedia data collections: ü  it’s all about management requirements ü  Polyglot persistence general principle ü  The NoSQL plethora ¡  Polyglot database ¡  Design ¡  Deployment ¡  Querying, update and maintenance ¡  Conclusions and outlook 23
  • 24. DESIGNING AND BUILDING A POLYGLOT DATABASE 24
  • 25. 25 OBJECTIVE ¡  Build a MyNet app based on a polyglot database for building an integrated directory of my contacts including their status and posts from several social networks MyNet  DB  MyNet  App   Social  network  
  • 26. 26 Analysis on contacts networks, overlapping according to interests, posts topics Top 10 most popular contacts User sessions in different Social networks Integrating posts from all networks Contact graph traversal For building groups out of Common characteristics Synchronizing posts to all SNFriends network User accounts activity In different social networks Directory synchronisation Integrating contacts’ information From all SN
  • 27. NOSQL DESIGN AND CONSTRUCTION PROCESS ¡  Data reside in RAM (memcached) and is eventually replicated and stored ¡  Querying = designing a database according to the type of queries / map reduce model ¡  “On demand” data management: the database is virtually organized per view (external schema) on cache and some view are made persistent ¡  An elastic easy to evolve and explicitly configurable architecture 27 Database querying Database population Database organization INDEX   Memcached Replicated Stored
  • 28. GENERATING NOSQL PROGRAMS FROM HIGH LEVEL ABSTRACTIONS 28 UML class diagram application classes Spring Roo Java   web   App   Spring Data Graph database Relational database High-level abstractions Low-level abstractions http://code.google.com/p/model2roo/    
  • 29. DEPLOYING A POLYGLOT DATABASE 29
  • 30. DEPLOYING A DATABASE APPLICATION ON THE CLOUD Have a first experience on ¡  creating a database on a relational DBMS deployed on the cloud, ¡  building a simple database application exported as a service, ¡  deploying the service on the cloud and implement an interface. 30 MyNet  DB   MyNet  App   Contact' idContact:"Integer" firstName:"String" lastName:"String" Service   App  
  • 31. SHARDING AND DEPLOYING SHARDS ON THE CLOUD Have a first experience on ¡  sharding a relational DBMS ¡  dealing with shards synchronization ¡  deploying the service on the cloud and implement an interface 31 MyNet  DB   MyNet   Service   webSite:(URI( socialNetworkID:(URI(( ( Basic(Info( Contact' idContact:(Integer( lastName:(String( givenName:(String( society:(String( <<'hasBasicInfo'>>' 1' *' postID:(Integer( timeStamp:(Date( geoStamp:(String( Post' contentID:(Integer( text:(String( Content' <<'hasContent'>>' <<publishesPost>>' *' 1' 1' 1' street:(String,(( number:(Integer,( City:(Sting,(( Zipcode:(Integer( Address' *' email:(String( type:({pers,(prof}( Email' <<'hasAddress'>>' *' 1' <<'hasEmail'>>' ID:(Integer,( Lang:(String( Language' *' <<'speaksLanguage'>>' Français   Español   English   Others  
  • 32. REST   JSON  documents   MyNet     MULTI-CLOUD POLYGLOT DATABASE MyNetContacts   webSite:(URI( socialNetworkID:(URI(( ( Basic(Info( Contact' idContact:(Integer( lastName:(String( givenName:(String( society:(String( <<'hasBasicInfo'>>' <<'isContactof'>>' groupName:(String( Group' <<'isComposedof'>>' 1' *' 1' *'1' *' postID:(Integer( timeStamp:(Date( geoStamp:(String( Post' contentID:(Integer( text:(String( image:(Jpeg( video:(Avi( Content' <<'hasContent'>>' <<'publishesPost'>>' *' 1' 1' 1' street:(String,(( number:(Integer,( City:(Sting,(( Zipcode:(Integer( Address' 1' *' number:(String( type:({pers,(prof}( Phone' email:(String( type:({pers,(prof}( Email' <<'hasAddress'>>' <<'hasOtherPhones'>>' 1' *' 1' *' *' <<'hasMobilePhones'>>' 1' <<'hasEmail'>>' webSite:(URI( socialNetworkID:(URI(( ( Basic(Info( Contact' idContact:(Integer( lastName:(String( givenName:(String( society:(String( <<'hasBasicInfo'>>' <<'isContactof'>>' groupName:(String( Group' <<'isComposedof'>>' 1' *' 1' *'1' *' postID:(Integer( timeStamp:(Date( geoStamp:(String( Post' contentID:(Integer( text:(String( image:(Jpeg( video:(Avi( Content' <<'hasContent'>>' <<'publishesPost'>>' *' 1' 1' 1' street:(String,(( number:(Integer,( City:(Sting,(( Zipcode:(Integer( Address' 1' *' number:(String( type:({pers,(prof}( Phone' email:(String( type:({pers,(prof}( Email' <<'hasAddress'>>' <<'hasOtherPhones'>>' 1' *' 1' *' *' <<'hasMobilePhones'>>' 1' <<'hasEmail'>>' webSite:(URI( socialNetworkID:(URI(( ( Basic(Info( Contact' idContact:(Integer( lastName:(String( givenName:(String( society:(String( <<'hasBasicInfo'>>' <<'isContactof'>>' groupName:(String( Group' <<'isComposedof'>>' 1' *' 1' *'1' *' postID:(Integer( timeStamp:(Date( geoStamp:(String( Post' contentID:(Integer( text:(String( image:(Jpeg( video:(Avi( Content' <<'hasContent'>>' <<'publishesPost'>>' *' 1' 1' 1' street:(String,(( number:(Integer,( City:(Sting,(( Zipcode:(Integer( Address' 1' *' number:(String( type:({pers,(prof}( Phone' email:(String( type:({pers,(prof}( Email' <<'hasAddress'>>' <<'hasOtherPhones'>>' 1' *' 1' *' *' <<'hasMobilePhones'>>' 1' <<'hasEmail'>>'
  • 33. MANAGING A POLYGLOT DATABASE QUERYING, INSERTING, MAINTAINING 33
  • 34. CRUD OPERATIONS 34 Contact Group Content Post Id   FirstName   LastName   Society   Consistent view of data
  • 35. EXAMPLE 1: SYNCHRONIZING REDIS+MYSQL 35 https://oracleus.activeevents.com/connect/sessionDetail.ww?SESSION_ID=4775     Updating REDIS #FAIL begin  MySQL  transaction    update  MySQL    update  Redis   rollback  MySQL  transaction   begin  MySQL  transaction    update  MySQL   commit  MySQL  transaction   <<  system  crashes  >>   update  Redis   Redis has updated MySQL does not MySQL has updated Redis does not
  • 36. EXAMPLE 1: UPDATING REDIS RELIABLY Step I Step 2 36 begin  MySQL  transaction    update  MySQL    queue  CRUD  event  in  MySQL   commit  transaction   Event  Id   Operation:  Create,  Update,  Delete   queue  CRUD  event  in  MySQL   New  entity  state,  e.g.  JSON   ACID for  each  CRUD  event  in  MySQL  queue     get  next  CRUD  event  from  MySQL  queue   if  CRUD  event  is  not  duplicate  then    update  Redis  (incl.  eventID)   end  if     begin  MySQL  transaction    mark  CRUD  event  processed   commit  transaction     end  for  each  
  • 37. EXAMPLE 1: UPDATING REDIS RELIABLY 37 EntityCRUDEvent   Repository   EntityCRUDEvent   Processor   Redis  updater   ID   JSON   Processed?   INSERT  INTO  ..   SELECT  …  FROM..   apply(event)   Timer   Step 1 Step 2
  • 38. EXAMPLE 1:TRACKING CHANGES (HIBERNATE) 38
  • 39. POLYGLOT DATABASE EVOLUTION ¡  Problem statement: ¡  Evolution of the application: modification of classes, new classes, new relationships among classes ¡  Evolution of the “entities” managed in the polyglot database ¡  Some change structure, change values, … ¡  The content of the stores start deriving from the application data structures ¡  Which is the current structure of the entities stored? ¡  Are there elements that are not being accessed because they do not longer correspond to the application data structures? 39 Contact' idContact:"Integer" firstName:"String" lastName:"String" webSite:(URI( socialNetworkID:(URI(( ( Basic(Info( Contact' idContact:(Integer( lastName:(String( givenName:(String( society:(String( <<'hasBasicInfo'>>' 1' *' postID:(Integer( timeStamp:(Date( geoStamp:(String( Post' contentID:(Integer( text:(String( Content' <<'hasContent'>>' <<publishesPost>>' *' 1' 1' 1' street:(String,(( number:(Integer,( City:(Sting,(( Zipcode:(Integer( Address' *' email:(String( type:({pers,(prof}( Email' <<'hasAddress'>>' *' 1' <<'hasEmail'>>' ID:(Integer,( Lang:(String( Language' *' <<'speaksLanguage'>>'
  • 40. EXTRACTING SCHEMATA FROM NOSQL PROGRAMS CODE http://code.google.com/p/exschema/     40 Metalayer( Struct( Relationship( Attribute(Set( *( *( *( *( *( *( *( *(*( *( *( *( *( *( Declarations   analyzer   Updates   analyzer   Repositories   analyzer   Schema1 Schema2 Schema3 Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.UserInfo Struct Attribute name : userId Attribute name : firstName Attribute name : lastName Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.Tweet Struct Attribute name : text Attribute name : userId Set Attribute implementation : Neo4j Struct Attribute name : fr.imag.twitter.domain.User Str Attribute name : nodeId Attribute name : userName A na Relatio start end Schema1 Schema2 Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.UserInfo Struct Attribute name : userId Attribute name : firstName Attribute name : lastName Set Attribute implementation : Spring Repository Attribute name : fr.imag.twitter.domain.Tweet Attribute name : text Attri name : Schema1 Schema2 Schema3 Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.UserInfo Struct Attribute name : userId Attribute name : firstName Attribute name : lastName Set Attribute implementation : Spring Repository Set Attribute name : fr.imag.twitter.domain.Tweet Struct Attribute name : text Attribute name : userId Set Attribute implementation : Neo4j Struct Attribute name : fr.imag.twitter.domain.User Struct Attribute name : nodeId Attribute name : userName Attribute name : userId Attribute name : password Relationship start end Attribute name : followers Spring Data
  • 41. EXTRACTING SCHEMATA FROM A POLYGLOT DATABASE 41 Application  source  code   Metalayer( Struct( Relationship( Attribute(Set( *( *( *( *( *( *( *( *(*( *( *( *( *( *( Declarations   Analyzer   Updates   Analyzer   Annotations   Analyzer   Repositories   Analyzer   ExSchema   Variables/Methods  Graph  Store   Annotations/Variables     Relational  &     document  stores  
  • 42. AGENDA ü  Dealing with multiform and multimedia data collections: ü  it’s all about management requirements ü  Polyglot persistence general principle ü  The NoSQL plethora ü  Polyglot database ü  Design ü  Deployment ü  Querying, update and maintenance ¡  Conclusions and outlook 42
  • 43. CHALLENGE: POLYGLOT MEETS XPERANTO Given a data collection coming from different social networks stored on NoSQL systems (Neo4J and Mongo) [possibly according to a strategy combining sharding and replication techniques], extend the UnQL pivot query language considering ¡  Data processing operators adapted to query different data models (graph, document). Example query Neo + Mongo and what about Join, Union … ¡  Assuming concurrent CRUD operations to the stores can you expect query results to be consistent ? How can you tag your results or implement a sharding strategy in order to determine whether results are consistent? ¡  Querying data represented on different models: How can you exploit the structure of the different stores for expressing queries ? Provide adapted operators? Give generic operators and then rewrite queries? ¡  Normally, Polyglot solutions tend to solve some data processing issues in the application code.This can be penalizing. Discuss the challenges to address for ensuring that your queries will be able to scale as the collection grows. 43
  • 44. CHALLENGE: EXPECTED RESULTS ¡  Give the principle of your proposal through a partial programming solution, of the operators of your UnQL extension, detail the query evaluation process if U want your solution to scale ¡  We ask U to sketch the solution on the polyglot database that we provide consisting of mongo, Neo4J stores ¡  https://github.com/jccastrejon/edbt-unql ¡  Technical requirements:VMware player 5 44
  • 45. WHEN IS POLYGLOT PERSISTENCE PERTINENT? ¡  Application essentially composing and serving web pages ¡  They only looked up page elements by ID, they had different needs or availability, concurrency and no need to share all their data ¡  A problem like this is much better suited to a NoSQL store than the corporate relational DBMS ¡  Scaling to lots of traffic gets harder and harder to do with vertical scaling ¡  Many NoSQL databases are designed to operate over clusters ¡  They can tackle larger volumes of traffic and data than is realistic with a single server 45
  • 46. CONCLUSIONS ¡  Data are growing big and more heterogeneous and they need new adapted ways to be managed thus the NoSQL movement is gaining momentum ¡  Data heterogeneity implies different management requirements this is where polyglot persistence comes up ¡  Consistency – Availability – Fault tolerance theorem: find the balance ! ¡  Which data store according to its data model? ¡  A lot of programming implied … 46 Open opportunities if you’re interested in this topic!
  • 47. Contact: GenovevaVargas-Solar, CNRS, LIG-LAFMIA Genoveva.Vargas@imag.fr   http://www.vargas-­‐solar.com/teaching/         http://code.google.com/p/exschema/     47 http://code.google.com/p/model2roo/     Open source polyglot persistence tools
  • 48. Prof.  Dr.  Christine  Collet   Grenoble  INP   France   Juan  Carlos  Castrejón   University  of  Grenoble   France   Javier  Espinosa   University  of  Grenoble   France   Dr.  Genoveva  Vargas-­‐Solar   CNRS,  LIG-­‐LAFMIA   France  
  • 49. REFERENCES ¡  Eric  A.,  Brewer  "Towards  robust  distributed  systems."  PODC.  2000   ¡  Rick,  Cattell  "Scalable  SQL  and  NoSQL  data  stores."  ACM  SIGMOD  Record  39.4  (2011):  12-­‐27   ¡  Juan   Castrejon,   Genoveva   Vargas-­‐Solar,   Christine   Collet,   and   Rafael   Lozano,   ExSchema:   Discovering  and  Maintaining  Schemas  from  Polyglot  Persistence  Applications,  In  Proceedings  of   the  International  Conference  on  Software  Maintenance,  Demo  Paper,  IEEE,  2013     ¡  M.  Fowler  and  P.  Sadalage.  NoSQL  Distilled:  A  Brief  Guide  to  the  Emerging  World  of  Polyglot   Persistence.  Pearson  Education,  Limited,  2012   ¡  C.   Richardson,   Developing   polyglot   persistence   applications,   http://fr.slideshare.net/ chris.e.richardson/developing-­‐polyglotpersistenceapplications-­‐gluecon2013   49
  • 51. DATA MODELS ¡  Tuple ¡  Row in a relational table, where attributes are pre-defined in a schema, and the values are scalar ¡  Document ¡  Allows values to be nested documents or lists, as well as scalar values. ¡  Attributes are not defined in a global schema ¡  Extensible record ¡  Hybrid between tuple and document, where families of attributes are defined in a schema, but new attributes can be added on a per-record basis 51
  • 52. DATA STORES ¡  Key-value ¡  Systems that store values and an index to find them, based on a key ¡  Document ¡  Systems that store documents, providing index and simple query mechanisms ¡  Extensible record ¡  Systems that store extensible records that can be partitioned vertically and horizontally across nodes ¡  Graph ¡  Systems that store model data as graphs where nodes can represent content modelled as document or key-value structures and arcs represent a relation between the data modelled by the node ¡  Relational ¡  Systems that store, index and query tuples 52
  • 53. KEY-VALUE STORES ¡  “Simplest data stores” use a data model similar to the memcached distributed in-memory cache ¡  Single key-value index for all data ¡  Provide a persistence mechanism ¡  Replication, versioning, locking, transactions, sorting ¡  API: inserts, deletes, index lookups ¡  No secondary indices or keys 53 SYSTEM ADDRESS Redis code.google.com/p/redis   Scalaris code.google.com/p/scalaris   Tokyo tokyocabinet.sourceforge.net   Voldemort project-­‐voldemort.com   Riak riak.basho.com   Membrain schoonerinfotech.com/products   Membase membase.com  
  • 54. SELECT    name   FROM        group   WHERE      gid  IN  (  SELECT    gid                                    FROM        group_member                                    WHERE      uid  =  me()  )   54 SELECT    name,  pic,  profile_url   FROM        user   WHERE      uid  =  me()   SELECT    name,  pic   FROM        user   WHERE      online_presence  =  "active"                    AND                  uid  IN  (  SELECT    uid2                                    FROM        friend                                    WHERE      uid1  =  me()  )   SELECT    name   FROM        friendlist   WHERE      owner  =  me()   SELECT    message,  attachment   FROM        stream   WHERE      source_id  =  me()  AND  type  =  80   https://developers.facebook.com/docs/reference/fql/  
  • 55. 55 <805114856,                 >
  • 56. DOCUMENT STORES ¡  Support more complex data: pointerless objects, i.e., documents ¡  Secondary indexes, multiple types of documents (objects) per database, nested documents and lists, e.g. B-trees ¡  Automatic sharding (scale writes), no explicit locks, weaker concurrency (eventual for scaling reads) and atomicity properties ¡  API: select,  delete,  getAttributes,   putAttributes  on documents ¡  Queries can be distributed in parallel over multiple nodes using a map-reduce mechanism 56 SYSTEM ADDRESS SimpleDB amazon.com/simpledb   Couch DB couchdb.apache.org   Mongo DB mongodb.org   Terrastore code.google.com/terrastore  
  • 58. EXTENSIBLE RECORD STORES ¡  Basic data model is rows and columns ¡  Basic scalability model is splitting rows and columns over multiple nodes ¡  Rows split across nodes through sharding on the primary key ¡  Split by range rather than hash function ¡  Rows analogous to documents: variable number of attributes, attribute names must be unique ¡  Grouped into collections (tables) ¡  Queries on ranges of values do not go to every node ¡  Columns are distributed over multiple nodes using “column groups” ¡  Which columns are best stored together ¡  Column groups must be pre-defined with the extensible record stores 58 SYSTEM ADDRESS HBase hbase.apache.com   HyperTable hypertable.org   Cassandra incubator.apache.org/cassandra  
  • 59. SCALABLE RELATIONAL SYSTEMS ¡  SQL: rich declarative query language ¡  Databases reinforce referential integrity ¡  ACID semantics ¡  Well understood operations: ¡  Configuration, Care and feeding, Backups,Tuning, Failure and recovery, Performance characteristics ¡  Use small-scope operations ¡  Challenge: joins that do not scale with sharding ¡  Use small-scope transactions ¡  ACID transactions inefficient with communication and 2PC overhead ¡  Shared nothing architecture for scalability ¡  Avoid cross-node operations 59 SYSTEM ADDRESS MySQL C mysql.com/cluster   Volt DB voltdb.com   Clustrix clustrix.com   ScaleDB scaledb.com   Scale Base scalebase.com   Nimbus DB nimbusdb.com  
  • 61. REPLICATION MASTER - SLAVE ¡  Makes one node the authoritative copy/replica that handles writes while replica synchronize with the master and may handle reeds ¡  All replicas have the same weight ¡  Replicas can all accept writes ¡  The lose of one of them does not prevent access to the data store ¡  Helps with read scalability but does not help with write scalability ¡  Read resilience: should the master fail, slaves can still handle read requests ¡  Master failure eliminates the ability to handle writes until either the master is restored or a new master is appointed ¡  Biggest complication is consistency ¡  Possible write – write conflict ¡  Attempt to update the same record at the same time from to different places ¡  Master is a bottle-neck and a point of failure 61 Master' Slaves' all'updates' made'to'the'master' changes'propagate'' To'slaves' reads'can'be'done' from'master'or'slaves'
  • 62. MASTER-SLAVE REPLICATION MANAGEMENT ¡  Masters can be appointed ¡  Manually when configuring the nodes cluster ¡  Automatically: when configuring a nodes cluster one of them elected as master.The master can appoint a new master when the master fails reducing downtime ¡  Read resilience ¡  Read and write paths have to be managed separately to handle failure in the write path and still reads can occur ¡  Reads and writes are put in different database connections if the database library accepts it ¡  Replication comes inevitably with a dark side: inconsistency ¡  Different clients reading different slaves will see different values if changes have not been propagated to all slaves ¡  In the worst case a client cannot read a write it just made ¡  Even if master-slave is used for hot backups, if the master fails any updates on to the backup are lost 62
  • 63. REPLICATION: PEER-TO-PEER ¡  Allows writes to any node; the nodes coordinate to synchronize their copies ¡  The replicas have equal weight ¡  Deals with inconsistencies ¡  Replicas coordinate to avoid conflict ¡  Network traffic cost for coordinating writes ¡  Unnecessary to make all replicas agree to write, only the majority ¡  Survival to the loss of the minority of replicas nodes ¡  Policy to merge inconsistent writes ¡  Full performance on writing to any replica 63 Master' nodes'communicate' their'writes' all'nodes'read' and'write'all'data'
  • 64. SHARDING ¡  Ability to distribute both data and load of simple operations over many servers, with no RAM or disk shared among servers ¡  A way to horizontally scale writes ¡  Improve read performance ¡  Application/data store support 64 ¡  Puts different data on separate nodes ¡  Each user only talks to one servicer so she gets rapid responses ¡  The load should be balanced out nicely between servers ¡  Ensure that ¡  data that is accessed together is clumped together on the same node ¡  that clumps are arranged on the nodes to provide best data access Each%shard%reads%and% writes%its%own%data%
  • 65. SHARDING Database laws ¡  Small databases are fast ¡  Big databases are slow ¡  Keep databases small Principle ¡  Start with a big monolithic database ¡  Break into smaller databases ¡  Across many clusters ¡  Using a key value 65 Instead of having one million customers information on a single big machine …. 100 000 customers on smaller and different machines
  • 66. SHARDING CRITERIA ¡  Partitioning ¡  Relational: handled by the DBMS (homogeneous DBMS) ¡  NoSQL: based on ranging of the k-value ¡  Federation ¡  Relational ¡  Combine tables stored in different physical databases ¡  Easier with denormalized data ¡  NoSQL: ¡  Store together data that are accessed together ¡  Aggregates unit of distribution 66
  • 67. SHARDING Architecture ¡  Each application server (AS) is running DBS/client ¡  Each shard server is running ¡  a database server ¡  replication agents and query agents for supporting parallel query functionality Process ¡  Pick a dimension that helps sharding easily (customers, countries, addresses) ¡  Pick strategies that will last a long time as repartition/ re-sharding of data is operationally difficult ¡  This is done according to two different principles ¡  Partitioning: a partition is a structure that divides a space into tow parts ¡  Federation: a set of things that together compose a centralized unit but each individually maintains some aspect of autonomy 67 Customers data is partitioned by ID in shards using an algorithm d to determine which shard a customer ID belongs to
  • 68. REPLICATION:ASPECTS TO CONSIDER ¡  Conditioning ¡  Important elements to consider ¡  Data to duplicate ¡  Copies location ¡  Duplication model (master – slave / P2P) ¡  Consistency model (global – copies) 68 Fault   tolerance   Availability   Transparency   levels   Performance   à Find a compromise !
  • 69. PARTITIONING A PARTITION IS A STRUCTURE THAT DIVIDES A SPACE INTO TOW PARTS 69
  • 70. BACKGROUND: DISTRIBUTED RELATIONAL DATABASES ¡  External schemas (views) are often subsets of relations (contacts in Europe and America) ¡  Access defined on subsets of relations: 80% of the queries issued in a region have to do with contacts of that region ¡  Relations partition ¡  Better concurrency level ¡  Fragments accessed independently ¡  Implications ¡  Check integrity constraints ¡  Rebuild relations 70
  • 71. FRAGMENTATION ¡  Horizontal ¡  Groups of tuples of the same relation ¡  Budget < 300 000 or >= 150 000 ¡  Not disjoint are more difficult to manage ¡  Vertical ¡  Groups attributes of the same relation ¡  Separate budget from loc and pname of the relation project ¡  Hybrid 71
  • 72. FRAGMENTATION: RULES Vertical ¡  Clustering ¡  Grouping elementary fragments ¡  Budget and location information in two relations ¡  Splitting ¡  Decomposing a relation according to affinity relationships among attributes Horizontal ¡  Tuples of the same fragment must be statistically homogeneous ¡  If t1 and t2 are tuples of the same fragment then t1 and t2 have the same probability of being selected by a query ¡  Keep important conditions ¡  Complete ¡  Every tuple (attribute) belongs to a fragment (without information loss) ¡  If tuples where budget >= 150 000 are more likely to be selected then it is a good candidate ¡  Minimum ¡  If no application distinguishes between budget >= 150 000 and budget < 150 000 then these conditions are unnecessary 72
  • 73. SHARDING: HORIZONTAL PARTITIONING ¡  The entities of a database are split into two or more sets (by row) ¡  In relational: same schema several physical bases/ servers ¡  Partition contacts in Europe and America shards where they zip code indicates where the will be found ¡  Efficient if there exists some robust and implicit way to identify in which partition to find a particular entity ¡  Last resort shard ¡  Needs to find a sharding function: modulo, round robin, hash – partition, range - partition 73 Load%balancer% Cache%1% Cache%2% Cache%3% MySQL% Master% MySQL% Master% Web%1% Web%2% Web%3% Even%IDs% Odd%IDs% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n%
  • 74. FEDERATION A FEDERATION IS A SET OF THINGS THAT TOGETHER COMPOSE A CENTRALIZED UNIT BUT EACH INDIVIDUALLY MAINTAINS SOME ASPECT OF AUTONOMY 74
  • 75. FEDERATION:VERTICAL SHARDING ¡  Principle ¡  Partition data according to their logical affiliation ¡  Put together data that are commonly accessed ¡  The search load for the large partitioned entity can be split across multiple servers (logical and physical) and not only according to multiple indexes in the same logical server ¡  Different schemas, systems, and physical bases/ servers ¡  Shards the components of a site and not only data 75 Load%balancer% Cache%1% Cache%2% Cache%3% MySQL% Master% MySQL% Master% Web%1% Web%2% Web%3% Site%database% Resume%database% MySQL% Slave%1% MySQL% Slave%2% MySQL% Slave%n% MySQL% Slave%1% Internal% user%
  • 76. NOSQL STORES: PERSISTENCY MANAGEMENT 76
  • 77. «MEMCACHED» ¡  «memcached» is a memory management protocol based on a cache: ¡  Uses the key-value notion ¡  Information is completly stored in RAM ¡  «memcached» protocol for: ¡  Creating, retrieving, updating, and deleting information from the database ¡  Applications with their own «memcached» manager (Google, Facebook, YouTube, FarmVille,Twitter,Wikipedia) 77
  • 78. STORAGE ON DISC (1) ¡  For efficiency reasons, information is stored using the RAM: ¡  Work information is in RAM in order to answer to low latency requests ¡  Yet, this is not always possible and desirable Ø  The process of moving data from RAM to disc is called "eviction”; this process is configured automatically for every bucket 78
  • 79. STORAGE ON DISC (2) ¡  NoSQL servers support the storage of key-value pairs on disc: ¡  Persistency–can be executed by loading data, closing and reinitializing it without having to load data from another source ¡  Hot backups– loaded data are sotred on disc so that it can be reinitialized in case of failures ¡  Storage on disc– the disc is used when the quantity of data is higher thant the physical size of the RAM, frequently used information is maintained in RAM and the rest es stored on disc 79
  • 80. STORAGE ON DISC (3) ¡  Strategies for ensuring: ¡  Each node maintains in RAM information on the key-value pairs it stores. Keys: ¡  may not be found, or ¡  they can be stored in memory or on disc ¡  The process of moving information from RAM to disc is asynchronous: ¡  The server can continue processing new requests ¡  A queue manages requests to disc Ø  In periods with a lot of writing requests, clients can be notified that the server is termporaly out of memory until information is evicted 80
  • 82. MULTIVERSION CONCURRENCY CONTROL (MVCC) ¡  Objective: Provide concurrent access to the database and in programming languages to implement transactional memory ¡  Problem: If someone is reading from a database at the same time as someone else is writing to it, the reader could see a half-written or inconsistent piece of data. ¡  Lock: readers wait until the writer is done ¡  MVCC: ¡  Each user connected to the database sees a snapshot of the database at a particular instant in time ¡  Any changes made by a writer will not be seen by other users until the changes have been completed (until the transaction has been committed ¡  When an MVCC database needs to update an item of data it marks the old data as obsolete and adds the newer version elsewhere à multiple versions stored, but only one is the latest ¡  Writes can be isolated by virtue of the old versions being maintained ¡  Requires (generally) the system to periodically sweep through and delete the old, obsolete data objects 82