Aeneas:: An Extensible NoSql Enhancing Application System

An Extensible NoSql Enhancing Application System



Big data emergency:
needing to scale horizontally linearly
RDBMS systems support complex interrogation
(e.g. JOINs, GROUP BYs ..) and guarantees property
which are complex to execute in a cluster.




Atomicy Consistency Isolation Durability:
=> master-slave architecture.
GROUP BYs, JOINs and Transactions (2PC)
=> distribute locks, possible deadlocks

A distributed system is characterized by the choice made between
availability or consistency when a network partition occurs or latency and
consistency otherwise.

{

RDBMS


Standard:
Schema with tables with
a fixed structure: each
row with the same
number and type of
columns

{

NOSQL


Far West:
Each database has is a
different data model with
its peculiarity. It can be a
key-Value store, Tables
, Document one , Column
or Row
Oriented, Graphs….

It is basically a 2 level map.. with ‚some‛ extensions.


The easier way to access to a value is through a row id and a
columns id:

clients[‘cesare88’][‘surname’]=‚cugnasco‛
Value
Column family

Column name
Row key

But…

The data model is flexible, it can be used in different ways.


The easier way to access to a value is through a row id and a
columns id:
sports[‘cesare88’][’9.5:bike’]=‚02/02/2013‛
sports[‘cesare88’][’8.3:running’]=‚02/07/2013‛
sports[‘cesare88’][’0.3:swimming’]=‚02/09/2012‛

The behavior is similar to a an hash map which contains sorted maps
(similar to a java HashMap<?, SortedMap<?,?>>(), or a
SortedMap<?, SortedMap<?,?>>())



Sorting: where to store the values?


Into the keys:
it requires to have a order preserving partitioner: one would have to
rebalance the token every time the data distribution change



Into the column names:
one can finely define how sort the values… but all the values will be stored in a
single node and thus only in a single node (plus replicas)







Indexing: custom indexing or Cassandra’s hash ones?
Composite types
Query design: better reiterate an interrogation or duplicate data?
Workload distribution: are the query well distributed on the nodes?

How to find the right solution? …many
experiments.

Testing many data model and configuration is a time and money
consuming process. The goal of the platform is to reduce this time
by:
 Supporting the data insertion:
loading a data source sample into many different data-models could be made by a
single configuration xml file.


Generate a queries code and workload :
the framework generates statically the code of the query and offers and offers a
support for trying the querying with different workload type: uniformly
distributed, zip-fian, consequential…



Collect data and analyze it:
running tests on a distributed environment requires to gather information from
many different server and application layers. The framework support it by
recording all metrics and storing them in a unique data store easy to interrogate.

the reference model


Which meta-data-model use to interpret the data into a
the data-model to test?

abel.castro
sport
benny.tucker sport
ruth.garcia
sport
pat.bradley
sport
rodney.murphy sport
olga.soto
group
olga.soto
sport
randall.foster sport
bradley.ball
sport
lucille.colon
sport
olga.soto
group
lyle.fletcher
sport
karla.evans
sport
roman.garza
group
bradley.ball
sport
pablo.myers
sport
greg.elliott
sport
sandy.silva
sport
devin.hardy
sport
ben.bowman sport

Jumping
Sailing
Artistic
Tennis
Archery
medium blue
Boxing
Rowing
Rugby union
Handball
medium blue
Water polo
Rowing
medium blue
Water polo
Jumping
Volleyball?
Judo
Fencing
Trampoline

1370222632
1348309276
1329156676
1353530029
1363225476

42.257831
42.295647
42.072377
41.676496
41.530270

2.406851
2.658962
2.809246
2.149473
3.044038

1337615059
1372530161
1375637435
1379432546

42.364499
42.303022
41.454311
41.776692

2.121684
2.978198
2.299615
2.426618

1385366732
1373036481

41.884708
41.720376

2.113619
2.935336

1341530737
1352117759
1361744929
1365447441
1343737817
1336532092

41.560617
41.964225
42.153057
41.456430
41.910421
42.273305

2.605116
2.377526
2.493464
2.340890
2.749560
3.000000

How reorder and insert data into Cassandra?

Good for the query :
‚who was running yesterday?‛

But for the query “which sport were practiced here yesterday?”
it is better another organization



A model may work poorly with a some type of
workloads:

If 90% of your clients just practice soccer, does you
model scale?



The workloader allows you to define the type of
distribution for each argument of the query

There are sports more
practiced then others?

The placed some place more
probable for some sports?

Does people make more
sports on Saturdays?

Who were running here yesterday?
Get user where
sport=? and
position=? and
day(time)=?



You have a cluster of N database servers plus C clients
and you need information about the internal metrics of
the database, of the java virtual machine, of the load of
the operative system, the network traffic load…
…..then collect all them and compare them by
time.
And, last but not least, you do not know exactly
what are you looking for!



Aneneas provides a web based workbench which allows
you to compare for each test hundreds of different
performance metrics.

Aeneas:: An Extensible NoSql Enhancing Application System

Aeneas:: An Extensible NoSql Enhancing Application System

Recommended

Recommended

More Related Content

Similar to Aeneas:: An Extensible NoSql Enhancing Application System

Similar to Aeneas:: An Extensible NoSql Enhancing Application System (20)

Recently uploaded

Recently uploaded (20)

Aeneas:: An Extensible NoSql Enhancing Application System

Editor's Notes