SlideShare a Scribd company logo
Bucket your partitions
wisely
Markus Höfer
IT Consultant
2
Recap c* partitions
3
• Partition defines on which c* node
the data resides
• Identified by partition key
• Nodes „own“ tokenranges which are
directly related to partitions
• Tokens calculated by hashing
partition key
Recap c* partitions
4
• DataStax recommendations for
partitions:
• Maximum number of rows:
hundreds of thousands
• Disk size: 100‘s of MB
Recap c* partitions
5
What‘s the problem with big partitions?
• Every request for these partitions hit the
same nodes. -> Not scaleable!
• Deleting frequently will slow down your reads
or even lead to
TombstoneOverwhelmingExceptions
Recap c* partitions
6
Use case
„notebook“
7
Use case - Environment
Keyspace µ
Keyspace µ
Keyspace µ
8
• Many concurrent processes
• Scaleability important
• Load peaks will happen!
Use case – Load and requirements
9
• A user (owner) can create a
notebook
• An owner can create notes
belonging to a notebook
• Users can fetch notes (idealy only
once), not necessarily in certain
order
• Users can delete notes
Use case
Note_by_notebook
P Notebook [text]
C Title [text]
Comment [text]
Owner [text]
CreatedOn
[timestamp]
10
First things first:
Dev: „How many notes per notebook?“
PO: „I assume a maximum of 100.000
notes“
Use case
11
Use case – Let‘s do the math
How many values do we store having
100.000 rows per notebook?‚
Nv = Nr ´(Nc - Npk - Ns )+ Ns
Note_by_notebook
P Notebook [text]
C Title [text]
Comment [text]
Creator [text]
CreatedOn
[timestamp]
num_rows
* num_regular_columns
+ num_static_columns
= values_per_notebook
100.000
* 3
+ 0
= 300.000
12
Use case – Size assumptions
Note_by_notebook
P Notebook [text] 16 bytes
C Title [text] 60 bytes
Comment [text] 200 bytes
Owner[text] 16 bytes
CreatedOn [timestamp] 8 bytes
13
Use case – Let‘s do the math
Ok, so how much data is that on disk?
sizeOf (cki
)+
i
å sizeOf (Csj
j
å )+ Nr ´ (sizeOf (Crk
)+ sizeOf (
l
å Ccl
))+8´ Nv
k
å
Note_by_notebook
P Notebook [text]
C Title [text]
Comment [text]
Owner [text]
CreatedOn [text]
sizeof(P)
+ sizeof(S)
+ num_rows
* (sizeof(C)+sizeof(regular_column))
+ 8*num_values
= bytes_per_partition
16 bytes
+ 0 bytes
+ 100.000
* (60 bytes + 224 bytes)
+ 8 bytes * 300.000
= 30.800.016 bytes
14
Use case
Dev: „31 MB for 100.000 rows on a
partition“
PO: „Sorry ‘bout that, but its going to
be 300.000 rows. Is that a problem?“
15
Use case – Let‘s do the math
How many data do we store having
300.000 rows per notebook?
Note_by_notebook
P Notebook [text]
C Title [text]
Comment [text]
Owner [text]
CreatedOn [text]
92.400.016 bytes
16
Use case
Dev: „That might be ok if we don‘t delete
too much, it‘ll be around 93 MB for
300.000 rows on a partition“
PO: „Small mistake on my side... It actually
could happen that someone inserts 20
million notes.“
17
Use case – Let‘s do the math
Ok, just for fun: How much data is that on
disk?
sizeOf (cki
)+
i
å sizeOf (Csj
j
å )+ Nr ´ (sizeOf (Crk
)+ sizeOf (
l
å Ccl
))+8´ Nv
k
å
Note_by_notebook
P Notebook [text]
C Title [text]
Comment [text]
Owner [text]
CreatedOn [text]sum(sizeof(P))
+ sum(sizeof(S))
+ num_rows
* (sum(sizeof(C)+sum(regular_column))
+ 8*num_values
= bytes_per_partition
16 bytes
+ 0 bytes
+ 20.000.000
* (60 bytes + 224 bytes)
+ 8 bytes * 60.000.000
= 6.160.000.016 bytes
18
Bucketing strategies
19
Bucketing strategies – Incrementing Bucket id
Incrementing bucket „counter“ based on row count inside
partition
+ Good if client is able to track the count
- Not very scalable
- Possible unreliable counter
insertNote bucketFull?
no
yes
Bucket++
notebook Bucket
n1 0
n1 1
Note_by_notebook
P Notebook [text]
P bucket [int]
C Title [text]
...
20
Bucketing strategies – Unique bucketing
insertNote bucketFull?
no
yes New bucket
uuid2
notebook Bucket
n1 uuid1
n1 uuid2
Identify buckets using uuids
+ Good if clients are able to track the count
+ Better scaleable
- Possibly unreliable counter
- Lookuptable(s) needed
Note_by_notebook
P Notebook [text]
P bucket [uuid]
C Title [text]
...
21
Bucketing strategies – Time based bucketing
Split partitions in descrete timeframes
e.g. new Bucket every 10
minutes
+ Amount of buckets per day defined
+ Fast solution on insert
- Not very scalable
Time notebook Bucket
0:00 – 0:10 n1 0
0:10 – 0:20 n1 1
0:20 – 0:30 n1 2
Note_by_notebook
P Notebook [text]
P bucket [int]
C Title [text]
...
22
Bucketing strategies – Hash based bucketing
Calculate buckets using primary key
Note_by_notebook
P Notebook [text]
C Title [text]
...
9523
% 2000
notebook Bucket
n1 1523
n1 1723
Example: Amount of Buckets = 2000
7723
% 2000
#
#
+ Amount of buckets defined
+ Deterministic
+ Fast solution
- Not possible if amount of rows is
unknown
Note_by_notebook
P Notebook [text]
P bucket [int]
C Title [text]
...
23
Incrementin
g
Time based Unique Hash based
Unknown amount
of Notes
- + -
Scaleable - - + -
No lookuptables
needed
- - +
Fast for writing + + +
Amount of
buckets known
- + - +
Bucketing strategies – Comparison
24
Datamodel
„notebook“
25
Datamodel – Unique bucketing
note_by_notebook
P Notebook [text]
P Bucket [timeuuid]
C Title [text]
Comment [text]
Creator [text]
CreatedOn
[timestamp]
notebook_partitions_by_na
me
P Notebook [text]
C Bucket [timeuuid]
notebook_partitions_by_note
P Notebook [text]
P Note_title [text]
Bucket [timeuuid]
Problems:
● How to make sure partitions
don‘t grow too big?
● How to make sure notes are not
picked twice?
26
How to make sure partitions don‘t grow
too big?
● Client side caching for writing
● Client instance „owns“ partition for distinct
time
● Creates new partition after this time
Datamodel
27
How to make sure notes are not picked
twice?
● Fetch whole partition not only one note
● Partition is „owned“ by one client instance for
a certain amount of time
● After that time it can be fetched again
Datamodel
28
Conclusion
● Scaleable
● Partition sizes something around 1000 notes
per notebook
● Fast writes
● Fast enough reads
Datamodel
29
Lessons learned
30
Lessons learned
• Annoy your PO!
• Be sure about your datamodel before going
productive!
• Do the math!
• Be aware of the problems caused by too big
partitions and tombstones!
• Delete partitions, not rows when possible!
31
Questions?
Markus Höfer
IT Consultant
markus.hoefer@codecentric.de
www.codecentric.de
blog.codecentric.de/en
HashtagMarkus

More Related Content

What's hot

HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
DataWorks Summit
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

What's hot (20)

HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUponHBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
HBaseCon 2012 | Lessons learned from OpenTSDB - Benoit Sigoure, StumbleUpon
 
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
Real-time Data Ingestion from Kafka to ClickHouse with Deterministic Re-tries...
 
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadinC* Summit 2013: The World's Next Top Data Model by Patrick McFadin
C* Summit 2013: The World's Next Top Data Model by Patrick McFadin
 
Cassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patternsCassandra nice use cases and worst anti patterns
Cassandra nice use cases and worst anti patterns
 
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
The Missing Manual for Leveled Compaction Strategy (Wei Deng & Ryan Svihla, D...
 
Developing Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsDeveloping Scylla Applications: Practical Tips
Developing Scylla Applications: Practical Tips
 
HBase Application Performance Improvement
HBase Application Performance ImprovementHBase Application Performance Improvement
HBase Application Performance Improvement
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
ClickHouse and the Magic of Materialized Views, By Robert Hodges and Altinity...
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
HBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region ReplicasHBase Read High Availabilty using Timeline Consistent Region Replicas
HBase Read High Availabilty using Timeline Consistent Region Replicas
 
10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse10 Good Reasons to Use ClickHouse
10 Good Reasons to Use ClickHouse
 
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEOClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Cassandra Day NY 2014: Apache Cassandra & Python for the The New York Times ⨍...
Cassandra Day NY 2014: Apache Cassandra & Python for the The New York Times ⨍...Cassandra Day NY 2014: Apache Cassandra & Python for the The New York Times ⨍...
Cassandra Day NY 2014: Apache Cassandra & Python for the The New York Times ⨍...
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEOTricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
Tricks every ClickHouse designer should know, by Robert Hodges, Altinity CEO
 
Altinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouseAltinity Quickstart for ClickHouse
Altinity Quickstart for ClickHouse
 
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel LiljencrantzC* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
C* Summit 2013: How Not to Use Cassandra by Axel Liljencrantz
 

Viewers also liked

How Do I Cassandra?
How Do I Cassandra?How Do I Cassandra?
How Do I Cassandra?
Rick Branson
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
alex_araujo
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
DataStax
 

Viewers also liked (20)

Large partition in Cassandra
Large partition in CassandraLarge partition in Cassandra
Large partition in Cassandra
 
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
Building a Fast, Resilient Time Series Store with Cassandra (Alex Petrov, Dat...
 
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
Troubleshooting Cassandra (J.B. Langston, DataStax) | C* Summit 2016
 
How Do I Cassandra?
How Do I Cassandra?How Do I Cassandra?
How Do I Cassandra?
 
Cassandra NoSQL Tutorial
Cassandra NoSQL TutorialCassandra NoSQL Tutorial
Cassandra NoSQL Tutorial
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
Cassandra Data Modelling with CQL (OSCON 2015)
Cassandra Data Modelling with CQL (OSCON 2015)Cassandra Data Modelling with CQL (OSCON 2015)
Cassandra Data Modelling with CQL (OSCON 2015)
 
Cassandra Meetup Boston - How Table "Shape" Affects Performance
Cassandra Meetup Boston - How Table "Shape" Affects PerformanceCassandra Meetup Boston - How Table "Shape" Affects Performance
Cassandra Meetup Boston - How Table "Shape" Affects Performance
 
NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2
NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2 NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2
NYC* Tech Day — BlueMountain Capital — Financial Time Series w/Cassandra 1.2
 
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
Cassandra Summit 2014: Lesser Known Features of Cassandra 2.1
 
ETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk LoadingETL With Cassandra Streaming Bulk Loading
ETL With Cassandra Streaming Bulk Loading
 
Escalabilidade Linear com o Banco de Dados NoSQL Apache Cassandra.
Escalabilidade Linear com o Banco de Dados NoSQL Apache Cassandra.Escalabilidade Linear com o Banco de Dados NoSQL Apache Cassandra.
Escalabilidade Linear com o Banco de Dados NoSQL Apache Cassandra.
 
Data Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and SparkData Science Lab Meetup: Cassandra and Spark
Data Science Lab Meetup: Cassandra and Spark
 
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
Cassandra Tuning - Above and Beyond (Matija Gobec, SmartCat) | Cassandra Summ...
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
Lessons Learned on Java Tuning for Our Cassandra Clusters (Carlos Monroy, Kne...
 
Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0Cassandra Troubleshooting 3.0
Cassandra Troubleshooting 3.0
 
How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Sum...
How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Sum...How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Sum...
How Cassandra Deletes Data (Alain Rodriguez, The Last Pickle) | Cassandra Sum...
 
On heap cache vs off-heap cache
On heap cache vs off-heap cacheOn heap cache vs off-heap cache
On heap cache vs off-heap cache
 
Real-time Cassandra
Real-time CassandraReal-time Cassandra
Real-time Cassandra
 

Similar to Bucket your partitions wisely - Cassandra summit 2016

1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
MongoDB
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
christinemaritza
 
Tema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxTema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptx
gopikahari7
 

Similar to Bucket your partitions wisely - Cassandra summit 2016 (20)

1404 app dev series - session 8 - monitoring & performance tuning
1404   app dev series - session 8 - monitoring & performance tuning1404   app dev series - session 8 - monitoring & performance tuning
1404 app dev series - session 8 - monitoring & performance tuning
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
 
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ libraryInterview with Anatoliy Kuznetsov, the author of BitMagic C++ library
Interview with Anatoliy Kuznetsov, the author of BitMagic C++ library
 
CA_mod05_ISA.ppt
CA_mod05_ISA.pptCA_mod05_ISA.ppt
CA_mod05_ISA.ppt
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 
5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB5 Pitfalls to Avoid with MongoDB
5 Pitfalls to Avoid with MongoDB
 
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov"Optimization of a .NET application- is it simple ! / ?",  Yevhen Tatarynov
"Optimization of a .NET application- is it simple ! / ?", Yevhen Tatarynov
 
RecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect HashingRecSplit Minimal Perfect Hashing
RecSplit Minimal Perfect Hashing
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 
Tema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptxTema2_ArchitectureMIPS.pptx
Tema2_ArchitectureMIPS.pptx
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
#GDC15 Code Clinic
#GDC15 Code Clinic#GDC15 Code Clinic
#GDC15 Code Clinic
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Gnocchi v3 brownbag
Gnocchi v3 brownbagGnocchi v3 brownbag
Gnocchi v3 brownbag
 

Recently uploaded

Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
MAQIB18
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
DilipVasan
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 

Recently uploaded (20)

Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Using PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDBUsing PDB Relocation to Move a Single PDB to Another Existing CDB
Using PDB Relocation to Move a Single PDB to Another Existing CDB
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
Computer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage sComputer Presentation.pptx ecommerce advantage s
Computer Presentation.pptx ecommerce advantage s
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
Exploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptxExploratory Data Analysis - Dilip S.pptx
Exploratory Data Analysis - Dilip S.pptx
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
Pre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptxPre-ProductionImproveddsfjgndflghtgg.pptx
Pre-ProductionImproveddsfjgndflghtgg.pptx
 
How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?How can I successfully sell my pi coins in Philippines?
How can I successfully sell my pi coins in Philippines?
 
Slip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp ClaimsSlip-and-fall Injuries: Top Workers' Comp Claims
Slip-and-fall Injuries: Top Workers' Comp Claims
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
2024-05-14 - Tableau User Group - TC24 Hot Topics - Tableau Pulse and Einstei...
 
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflictSupply chain analytics to combat the effects of Ukraine-Russia-conflict
Supply chain analytics to combat the effects of Ukraine-Russia-conflict
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 

Bucket your partitions wisely - Cassandra summit 2016

  • 3. 3 • Partition defines on which c* node the data resides • Identified by partition key • Nodes „own“ tokenranges which are directly related to partitions • Tokens calculated by hashing partition key Recap c* partitions
  • 4. 4 • DataStax recommendations for partitions: • Maximum number of rows: hundreds of thousands • Disk size: 100‘s of MB Recap c* partitions
  • 5. 5 What‘s the problem with big partitions? • Every request for these partitions hit the same nodes. -> Not scaleable! • Deleting frequently will slow down your reads or even lead to TombstoneOverwhelmingExceptions Recap c* partitions
  • 7. 7 Use case - Environment Keyspace µ Keyspace µ Keyspace µ
  • 8. 8 • Many concurrent processes • Scaleability important • Load peaks will happen! Use case – Load and requirements
  • 9. 9 • A user (owner) can create a notebook • An owner can create notes belonging to a notebook • Users can fetch notes (idealy only once), not necessarily in certain order • Users can delete notes Use case Note_by_notebook P Notebook [text] C Title [text] Comment [text] Owner [text] CreatedOn [timestamp]
  • 10. 10 First things first: Dev: „How many notes per notebook?“ PO: „I assume a maximum of 100.000 notes“ Use case
  • 11. 11 Use case – Let‘s do the math How many values do we store having 100.000 rows per notebook?‚ Nv = Nr ´(Nc - Npk - Ns )+ Ns Note_by_notebook P Notebook [text] C Title [text] Comment [text] Creator [text] CreatedOn [timestamp] num_rows * num_regular_columns + num_static_columns = values_per_notebook 100.000 * 3 + 0 = 300.000
  • 12. 12 Use case – Size assumptions Note_by_notebook P Notebook [text] 16 bytes C Title [text] 60 bytes Comment [text] 200 bytes Owner[text] 16 bytes CreatedOn [timestamp] 8 bytes
  • 13. 13 Use case – Let‘s do the math Ok, so how much data is that on disk? sizeOf (cki )+ i å sizeOf (Csj j å )+ Nr ´ (sizeOf (Crk )+ sizeOf ( l å Ccl ))+8´ Nv k å Note_by_notebook P Notebook [text] C Title [text] Comment [text] Owner [text] CreatedOn [text] sizeof(P) + sizeof(S) + num_rows * (sizeof(C)+sizeof(regular_column)) + 8*num_values = bytes_per_partition 16 bytes + 0 bytes + 100.000 * (60 bytes + 224 bytes) + 8 bytes * 300.000 = 30.800.016 bytes
  • 14. 14 Use case Dev: „31 MB for 100.000 rows on a partition“ PO: „Sorry ‘bout that, but its going to be 300.000 rows. Is that a problem?“
  • 15. 15 Use case – Let‘s do the math How many data do we store having 300.000 rows per notebook? Note_by_notebook P Notebook [text] C Title [text] Comment [text] Owner [text] CreatedOn [text] 92.400.016 bytes
  • 16. 16 Use case Dev: „That might be ok if we don‘t delete too much, it‘ll be around 93 MB for 300.000 rows on a partition“ PO: „Small mistake on my side... It actually could happen that someone inserts 20 million notes.“
  • 17. 17 Use case – Let‘s do the math Ok, just for fun: How much data is that on disk? sizeOf (cki )+ i å sizeOf (Csj j å )+ Nr ´ (sizeOf (Crk )+ sizeOf ( l å Ccl ))+8´ Nv k å Note_by_notebook P Notebook [text] C Title [text] Comment [text] Owner [text] CreatedOn [text]sum(sizeof(P)) + sum(sizeof(S)) + num_rows * (sum(sizeof(C)+sum(regular_column)) + 8*num_values = bytes_per_partition 16 bytes + 0 bytes + 20.000.000 * (60 bytes + 224 bytes) + 8 bytes * 60.000.000 = 6.160.000.016 bytes
  • 19. 19 Bucketing strategies – Incrementing Bucket id Incrementing bucket „counter“ based on row count inside partition + Good if client is able to track the count - Not very scalable - Possible unreliable counter insertNote bucketFull? no yes Bucket++ notebook Bucket n1 0 n1 1 Note_by_notebook P Notebook [text] P bucket [int] C Title [text] ...
  • 20. 20 Bucketing strategies – Unique bucketing insertNote bucketFull? no yes New bucket uuid2 notebook Bucket n1 uuid1 n1 uuid2 Identify buckets using uuids + Good if clients are able to track the count + Better scaleable - Possibly unreliable counter - Lookuptable(s) needed Note_by_notebook P Notebook [text] P bucket [uuid] C Title [text] ...
  • 21. 21 Bucketing strategies – Time based bucketing Split partitions in descrete timeframes e.g. new Bucket every 10 minutes + Amount of buckets per day defined + Fast solution on insert - Not very scalable Time notebook Bucket 0:00 – 0:10 n1 0 0:10 – 0:20 n1 1 0:20 – 0:30 n1 2 Note_by_notebook P Notebook [text] P bucket [int] C Title [text] ...
  • 22. 22 Bucketing strategies – Hash based bucketing Calculate buckets using primary key Note_by_notebook P Notebook [text] C Title [text] ... 9523 % 2000 notebook Bucket n1 1523 n1 1723 Example: Amount of Buckets = 2000 7723 % 2000 # # + Amount of buckets defined + Deterministic + Fast solution - Not possible if amount of rows is unknown Note_by_notebook P Notebook [text] P bucket [int] C Title [text] ...
  • 23. 23 Incrementin g Time based Unique Hash based Unknown amount of Notes - + - Scaleable - - + - No lookuptables needed - - + Fast for writing + + + Amount of buckets known - + - + Bucketing strategies – Comparison
  • 25. 25 Datamodel – Unique bucketing note_by_notebook P Notebook [text] P Bucket [timeuuid] C Title [text] Comment [text] Creator [text] CreatedOn [timestamp] notebook_partitions_by_na me P Notebook [text] C Bucket [timeuuid] notebook_partitions_by_note P Notebook [text] P Note_title [text] Bucket [timeuuid] Problems: ● How to make sure partitions don‘t grow too big? ● How to make sure notes are not picked twice?
  • 26. 26 How to make sure partitions don‘t grow too big? ● Client side caching for writing ● Client instance „owns“ partition for distinct time ● Creates new partition after this time Datamodel
  • 27. 27 How to make sure notes are not picked twice? ● Fetch whole partition not only one note ● Partition is „owned“ by one client instance for a certain amount of time ● After that time it can be fetched again Datamodel
  • 28. 28 Conclusion ● Scaleable ● Partition sizes something around 1000 notes per notebook ● Fast writes ● Fast enough reads Datamodel
  • 30. 30 Lessons learned • Annoy your PO! • Be sure about your datamodel before going productive! • Do the math! • Be aware of the problems caused by too big partitions and tombstones! • Delete partitions, not rows when possible!

Editor's Notes

  1. 2.0 -> 100.000 rows and < 100 MB Now 2.X
  2. 10 minutes = 144 buckets
  3. Which strategy fits best to our requirements