SlideShare a Scribd company logo
1 of 19
An Extensible NoSql Enhancing Application System


Big data emergency:
needing to scale horizontally linearly
RDBMS systems support complex interrogation
(e.g. JOINs, GROUP BYs ..) and guarantees property
which are complex to execute in a cluster.




Atomicy Consistency Isolation Durability:
=> master-slave architecture.
GROUP BYs, JOINs and Transactions (2PC)
=> distribute locks, possible deadlocks
A distributed system is characterized by the choice made between
availability or consistency when a network partition occurs or latency and
consistency otherwise.
{

RDBMS


Standard:
Schema with tables with
a fixed structure: each
row with the same
number and type of
columns

{

NOSQL


Far West:
Each database has is a
different data model with
its peculiarity. It can be a
key-Value store, Tables
, Document one , Column
or Row
Oriented, Graphs….
It is basically a 2 level map.. with ‚some‛ extensions.


The easier way to access to a value is through a row id and a
columns id:

clients[‘cesare88’][‘surname’]=‚cugnasco‛
Value
Column family

Column name
Row key

But…
The data model is flexible, it can be used in different ways.


The easier way to access to a value is through a row id and a
columns id:
sports[‘cesare88’][’9.5:bike’]=‚02/02/2013‛
sports[‘cesare88’][’8.3:running’]=‚02/07/2013‛
sports[‘cesare88’][’0.3:swimming’]=‚02/09/2012‛

The behavior is similar to a an hash map which contains sorted maps
(similar to a java HashMap<?, SortedMap<?,?>>(), or a
SortedMap<?, SortedMap<?,?>>())


Sorting: where to store the values?


Into the keys:
it requires to have a order preserving partitioner: one would have to
rebalance the token every time the data distribution change



Into the column names:
one can finely define how sort the values… but all the values will be stored in a
single node and thus only in a single node (plus replicas)







Indexing: custom indexing or Cassandra’s hash ones?
Composite types
Query design: better reiterate an interrogation or duplicate data?
Workload distribution: are the query well distributed on the nodes?

How to find the right solution? …many
experiments.
Testing many data model and configuration is a time and money
consuming process. The goal of the platform is to reduce this time
by:
 Supporting the data insertion:
loading a data source sample into many different data-models could be made by a
single configuration xml file.


Generate a queries code and workload :
the framework generates statically the code of the query and offers and offers a
support for trying the querying with different workload type: uniformly
distributed, zip-fian, consequential…



Collect data and analyze it:
running tests on a distributed environment requires to gather information from
many different server and application layers. The framework support it by
recording all metrics and storing them in a unique data store easy to interrogate.
the reference model


Which meta-data-model use to interpret the data into a
the data-model to test?

abel.castro
sport
benny.tucker sport
ruth.garcia
sport
pat.bradley
sport
rodney.murphy sport
olga.soto
group
olga.soto
sport
randall.foster sport
bradley.ball
sport
lucille.colon
sport
olga.soto
group
lyle.fletcher
sport
karla.evans
sport
roman.garza
group
bradley.ball
sport
pablo.myers
sport
greg.elliott
sport
sandy.silva
sport
devin.hardy
sport
ben.bowman sport

Jumping
Sailing
Artistic
Tennis
Archery
medium blue
Boxing
Rowing
Rugby union
Handball
medium blue
Water polo
Rowing
medium blue
Water polo
Jumping
Volleyball?
Judo
Fencing
Trampoline

1370222632
1348309276
1329156676
1353530029
1363225476

42.257831
42.295647
42.072377
41.676496
41.530270

2.406851
2.658962
2.809246
2.149473
3.044038

1337615059
1372530161
1375637435
1379432546

42.364499
42.303022
41.454311
41.776692

2.121684
2.978198
2.299615
2.426618

1385366732
1373036481

41.884708
41.720376

2.113619
2.935336

1341530737
1352117759
1361744929
1365447441
1343737817
1336532092

41.560617
41.964225
42.153057
41.456430
41.910421
42.273305

2.605116
2.377526
2.493464
2.340890
2.749560
3.000000
How reorder and insert data into Cassandra?

Good for the query :
‚who was running yesterday?‛
But for the query “which sport were practiced here yesterday?”
it is better another organization


A model may work poorly with a some type of
workloads:

If 90% of your clients just practice soccer, does you
model scale?


The workloader allows you to define the type of
distribution for each argument of the query

There are sports more
practiced then others?

The placed some place more
probable for some sports?

Does people make more
sports on Saturdays?

Who were running here yesterday?
Get user where
sport=? and
position=? and
day(time)=?


You have a cluster of N database servers plus C clients
and you need information about the internal metrics of
the database, of the java virtual machine, of the load of
the operative system, the network traffic load…
…..then collect all them and compare them by
time.
And, last but not least, you do not know exactly
what are you looking for!


Aneneas provides a web based workbench which allows
you to compare for each test hundreds of different
performance metrics.
Aeneas:: An Extensible NoSql Enhancing Application System

More Related Content

Similar to Aeneas:: An Extensible NoSql Enhancing Application System

no sql presentation
no sql presentationno sql presentation
no sql presentationchandanm2
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internalsnarsiman
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Mohammad Asif
 
Implementing the Database Server session 01
Implementing the Database Server  session 01Implementing the Database Server  session 01
Implementing the Database Server session 01Guillermo Julca
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelAndrey Lomakin
 
Exploring NoSQL and implementing through Cassandra
Exploring NoSQL and implementing through CassandraExploring NoSQL and implementing through Cassandra
Exploring NoSQL and implementing through CassandraDileep Kalidindi
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational modelChirag vasava
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»Olga Lavrentieva
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understandingsidra ali
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndicThreads
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGijiert bestjournal
 

Similar to Aeneas:: An Extensible NoSql Enhancing Application System (20)

nosql.pptx
nosql.pptxnosql.pptx
nosql.pptx
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
no sql presentation
no sql presentationno sql presentation
no sql presentation
 
NoSql Database
NoSql DatabaseNoSql Database
NoSql Database
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.Modeling data and best practices for the Azure Cosmos DB.
Modeling data and best practices for the Azure Cosmos DB.
 
NoSQL Basics - A Quick Tour
NoSQL Basics - A Quick TourNoSQL Basics - A Quick Tour
NoSQL Basics - A Quick Tour
 
Cassandra no sql ecosystem
Cassandra no sql ecosystemCassandra no sql ecosystem
Cassandra no sql ecosystem
 
Implementing the Database Server session 01
Implementing the Database Server  session 01Implementing the Database Server  session 01
Implementing the Database Server session 01
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Exploring NoSQL and implementing through Cassandra
Exploring NoSQL and implementing through CassandraExploring NoSQL and implementing through Cassandra
Exploring NoSQL and implementing through Cassandra
 
Dbms relational model
Dbms relational modelDbms relational model
Dbms relational model
 
«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»«Дизайн продвинутых нереляционных схем для Big Data»
«Дизайн продвинутых нереляционных схем для Big Data»
 
No sq lv2
No sq lv2No sq lv2
No sq lv2
 
Deep semantic understanding
Deep semantic understandingDeep semantic understanding
Deep semantic understanding
 
No sql
No sqlNo sql
No sql
 
Indic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path aheadIndic threads pune12-nosql now and path ahead
Indic threads pune12-nosql now and path ahead
 
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMINGEVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
EVALUATING CASSANDRA, MONGO DB LIKE NOSQL DATASETS USING HADOOP STREAMING
 
Master.pptx
Master.pptxMaster.pptx
Master.pptx
 

Recently uploaded

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 

Recently uploaded (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Aeneas:: An Extensible NoSql Enhancing Application System

  • 1. An Extensible NoSql Enhancing Application System
  • 2.  Big data emergency: needing to scale horizontally linearly RDBMS systems support complex interrogation (e.g. JOINs, GROUP BYs ..) and guarantees property which are complex to execute in a cluster.   Atomicy Consistency Isolation Durability: => master-slave architecture. GROUP BYs, JOINs and Transactions (2PC) => distribute locks, possible deadlocks
  • 3. A distributed system is characterized by the choice made between availability or consistency when a network partition occurs or latency and consistency otherwise.
  • 4. { RDBMS  Standard: Schema with tables with a fixed structure: each row with the same number and type of columns { NOSQL  Far West: Each database has is a different data model with its peculiarity. It can be a key-Value store, Tables , Document one , Column or Row Oriented, Graphs….
  • 5.
  • 6.
  • 7. It is basically a 2 level map.. with ‚some‛ extensions.  The easier way to access to a value is through a row id and a columns id: clients[‘cesare88’][‘surname’]=‚cugnasco‛ Value Column family Column name Row key But…
  • 8. The data model is flexible, it can be used in different ways.  The easier way to access to a value is through a row id and a columns id: sports[‘cesare88’][’9.5:bike’]=‚02/02/2013‛ sports[‘cesare88’][’8.3:running’]=‚02/07/2013‛ sports[‘cesare88’][’0.3:swimming’]=‚02/09/2012‛ The behavior is similar to a an hash map which contains sorted maps (similar to a java HashMap<?, SortedMap<?,?>>(), or a SortedMap<?, SortedMap<?,?>>())
  • 9.  Sorting: where to store the values?  Into the keys: it requires to have a order preserving partitioner: one would have to rebalance the token every time the data distribution change  Into the column names: one can finely define how sort the values… but all the values will be stored in a single node and thus only in a single node (plus replicas)     Indexing: custom indexing or Cassandra’s hash ones? Composite types Query design: better reiterate an interrogation or duplicate data? Workload distribution: are the query well distributed on the nodes? How to find the right solution? …many experiments.
  • 10. Testing many data model and configuration is a time and money consuming process. The goal of the platform is to reduce this time by:  Supporting the data insertion: loading a data source sample into many different data-models could be made by a single configuration xml file.  Generate a queries code and workload : the framework generates statically the code of the query and offers and offers a support for trying the querying with different workload type: uniformly distributed, zip-fian, consequential…  Collect data and analyze it: running tests on a distributed environment requires to gather information from many different server and application layers. The framework support it by recording all metrics and storing them in a unique data store easy to interrogate.
  • 11.
  • 12. the reference model  Which meta-data-model use to interpret the data into a the data-model to test? abel.castro sport benny.tucker sport ruth.garcia sport pat.bradley sport rodney.murphy sport olga.soto group olga.soto sport randall.foster sport bradley.ball sport lucille.colon sport olga.soto group lyle.fletcher sport karla.evans sport roman.garza group bradley.ball sport pablo.myers sport greg.elliott sport sandy.silva sport devin.hardy sport ben.bowman sport Jumping Sailing Artistic Tennis Archery medium blue Boxing Rowing Rugby union Handball medium blue Water polo Rowing medium blue Water polo Jumping Volleyball? Judo Fencing Trampoline 1370222632 1348309276 1329156676 1353530029 1363225476 42.257831 42.295647 42.072377 41.676496 41.530270 2.406851 2.658962 2.809246 2.149473 3.044038 1337615059 1372530161 1375637435 1379432546 42.364499 42.303022 41.454311 41.776692 2.121684 2.978198 2.299615 2.426618 1385366732 1373036481 41.884708 41.720376 2.113619 2.935336 1341530737 1352117759 1361744929 1365447441 1343737817 1336532092 41.560617 41.964225 42.153057 41.456430 41.910421 42.273305 2.605116 2.377526 2.493464 2.340890 2.749560 3.000000
  • 13. How reorder and insert data into Cassandra? Good for the query : ‚who was running yesterday?‛
  • 14. But for the query “which sport were practiced here yesterday?” it is better another organization
  • 15.  A model may work poorly with a some type of workloads: If 90% of your clients just practice soccer, does you model scale?
  • 16.  The workloader allows you to define the type of distribution for each argument of the query There are sports more practiced then others? The placed some place more probable for some sports? Does people make more sports on Saturdays? Who were running here yesterday? Get user where sport=? and position=? and day(time)=?
  • 17.  You have a cluster of N database servers plus C clients and you need information about the internal metrics of the database, of the java virtual machine, of the load of the operative system, the network traffic load… …..then collect all them and compare them by time. And, last but not least, you do not know exactly what are you looking for!
  • 18.  Aneneas provides a web based workbench which allows you to compare for each test hundreds of different performance metrics.

Editor's Notes

  1. Andiamo però per ordine. Prima di tutto una veloce introduzione sui database non relazionali, detti anche NoSQL. Negli ultimi anni abbiamo assistito ad una crescita esponenziale nella quantità di dati ed utenti concorrenti che i database devono gestire. Di conseguenza è diventato impensabile avere il database su un unico server, per quanto potente possa essere. Un database ormai gira su un cluster composto di più server connessi fra loro.Il problema è che scalare orizzontalmente ha alcuni problemi legare con la specifica relazionale. In particolare, per garantire le specifiche id AtomicyConsistencyIsolation i database relazionali hanno un architettura master-slave, che non permette di far crescere linearmente le performance con l’aggiunta di nuove macchine e allo stesso tempo è vulnerabile in caso di rottura di uno dei master. Inoltre, olte funzioni offerte dai db relazionali quali transizioni e joins richiedono presentano grossi problemi di scalabilità e possono richiedere lock distribuiti che possono portare deadlocks del sistema.
  2. Andiamo però per ordine. Prima di tutto una veloce introduzione sui database non relazionali, detti anche NoSQL. Negli ultimi anni abbiamo assistito ad una crescita esponenziale nella quantità di dati ed utenti concorrenti che i database devono gestire. Di conseguenza è diventato impensabile avere il database su un unico server, per quanto potente possa essere. Un database ormai gira su un cluster composto di più server connessi fra loro.Il problema è che scalare orizzontalmente ha alcuni problemi legare con la specifica relazionale. In particolare, per garantire le specifiche id AtomicyConsistencyIsolation i database relazionali hanno un architettura master-slave, che non permette di far crescere linearmente le performance con l’aggiunta di nuove macchine e allo stesso tempo è vulnerabile in caso di rottura di uno dei master. Inoltre, olte funzioni offerte dai db relazionali quali transizioni e joins richiedono presentano grossi problemi di scalabilità e possono richiedere lock distribuiti che possono portare deadlocks del sistema.
  3. Tutte queste difficoltà architetturali hanno portato alla nascità di diversi database che non implementando le specifiche relazionali provano diverse architetture e data-models per superare questi modelli. La differenze che salta forse più all’occhio in questi prodotto, e su cui ci concentreremo di più è la differenza nel datamodel.Mentre infatti i db relazionali hanno una standard unico, che permette di descrivere i dati in base a come essi si relazionano fra di loro, astraendo il modo in cui vengono fisicamente memorizzati, nei NoSQL questo non avviene. Ogni prodotto presenta un proprio tipo di data-model che rappresenta direttamente il modo in cui i dati sono memorizzati.
  4. Fra tutti i database non relazioni ne abbiamo scelto uno in particolare per il suo ampio utilizzo e supporto dalla comunità: Apache Cassandra. è bene notare che la piattaforma è stata comunque fatta in modo da poter essere facilmente applicata a diversi prodotti. Le caratteristiche di Cassandra che ci interessano di più è che può essere visto come una Distributehashtable, per cui dal valore del identificatore di una row è usato da un modulo, detto partitioner , per calcolare dove memorizzare il dato. Il partitioner compara il valore della chiave, o il valore del hash della chiave con l’initialtoken dei nodi cosi da calcolare a quale nodo appartenga. Come si può vedere quindi dalla figura Cassandra ha una struturapeer to peer: Il client può interrogare uno qualsiasi dei nodi, questo dalla chiave capirà a quale nodo memorizza il dato, lo interpellerà e riceverà la riposta e la rimanderà al client.
  5. Fra tutti i database non relazioni ne abbiamo scelto uno in particolare per il suo ampio utilizzo e supporto dalla comunità: Apache Cassandra. è bene notare che la piattaforma è stata comunque fatta in modo da poter essere facilmente applicata a diversi prodotti. Le caratteristiche di Cassandra che ci interessano di più è che può essere visto come una Distributehashtable, per cui dal valore del identificatore di una row è usato da un modulo, detto partitioner , per calcolare dove memorizzare il dato. Il partitioner compara il valore della chiave, o il valore del hash della chiave con l’initialtoken dei nodi cosi da calcolare a quale nodo appartenga. Come si può vedere quindi dalla figura Cassandra ha una struturapeer to peer: Il client può interrogare uno qualsiasi dei nodi, questo dalla chiave capirà a quale nodo memorizza il dato, lo interpellerà e riceverà la riposta e la rimanderà al client.
  6. Il datamodel di Cassandra è principalmente una mappa su due livelli: una prima chiave, detta rowkey ( nel esempio la mia matricola), definisce in quale server è memorizzata riga. Ogni riga è poi composta da colonne, che altro non sono che un accoppiata chiave valore (columnname e columnvalue) e che sono memorizzate in base al columnname. In questo caso il columnname è per l’appunto name ed il valore cesare
  7. Questo modo di utilizzare le columns e column family è simile al modello relazione. È però possibile estendere questo utilizzo pero. Per esempio, sfruttando il fatto che le colonne sono memorizzate ordinandole in base al columnname, è possibile ottenere dei risultati ordinati. Nel esempio tutti i voti di uno studente ordinati pe voto. Facendo un esempio con delle strutture dati usate in Java, il comportamento di questo data model è simile ad una hashmap che contiene come valori a sua volta delle sortedmap. È ben notare che nel caso il partitionatore sia orderperserving, è possibile avere anche le chiavi ordintate, è quindi più simile ad una sortedmap contente sortedmap a sua volta
  8. Conseguenza di questa flessibilità è che al momento di disegnare una base dati bisogna considerare molti aspetti. Per esempio, se è richiesto avere dei dati ordinati bisogna decidere se ordinarli nelle rowkeys oppure nelle column families, con diversi trade-off, se servono indici secondary decidere se crearne di propri in una column family aggiuntiva oppure usare quelli hash di Cassandra. A tutto questo si collega poi il design delle interrogazioni: meglio duplicare i dati o ripetere le interrogazioni? Come si comporta il data-model con un certo tipo di carico? E come reagisce al carico di lavoro? Un carico del tipo 80/20, con alcuni elementi molto più richiesti, come distribuisce il lavoro?Queste sono solo una parte delle variabili in gioco: quali scelte fare dipende da ogni singolo caso, anche in base al hardware disponibili. Di conseguenza la ricerca di una configurazione corretta richiede di provare e confrontare molte diverse configurazioni, in un processo che può essere costoso sia in termini di denaro che di tempo.
  9. Da qui l’idea di creare una piattaforma che aiuti uno sviluppatore in queste fasi. Per ogni test dovremmo scrivere il codice per inserire i dati e configurare il cluster seconda le configurazioni da provare, generare il codice per interrogare i dati e analizzare i risultati. Tutte queste tre fasi sono assistite dalla piattaforma.
  10. Il framework si divide quindi in tre parti: il loader, che legge i dati da una sorgente dati campione ed in base a dei file di configurazione xml che descrive come inserire i dati in cassandra, riorganizza i dati e li inserisce nel database. Il workloader che dapprima, in base al datamodel da provare e da un insieme di query da prova, genera il codice per interrogare cassandra. Inoltre è possibile impostare diversi tipo di carico in base a differenti modelli statistici.  Infine, c’è l’analisi reporter, che raccoglie informazioni da ogni singolo nodo del cluster e sia dal workloader e loader e li memorizza in un datastore unico, dividendo i dati per dati e nome del test, permetto di comparare facilmente i risultati dei diversi modelli.