SlideShare a Scribd company logo
1 of 53
Vadim Tkachenko
Percona
April’16
Percona Fractal Tree / TokuDB
Agenda
2
Why new data structure
Fractal Tree & LSM tree
Internals of Fractal Tree
When it is useful
How to use it
Why new data structure
Before it was B-Tree
• “Traditional” data structure
• In the field from 1970-ies
4
5
Before there was B-Tree
When B-Tree is good
• When datasize doesn’t exceed memory limits
• When the application is mostly performing read (SELECT)
operations, or when read performance is more important
than write performance
6
When B-Tree is not good
• As soon as the data size exceeds available memory,
performance drops rapidly
• Choosing a flash-based storage helps performance, but only
to a certain extent -- in the long run, memory limits cause
performance to suffer
7
To summarize
• B-tree was designed to provide optimal data retrieval
performance, but not data updates (insert, delete, update)
• This shortcoming created a need for data structures that
provide better performance for data storage.
8
Cases when B-Tree is not optimal
• accepting and storing event logs
• storing measurements from a high-frequency sensor,
• tracking user clicks, and so on.
• For such cases, two new data structures were created: log
structured merge (LSM) trees and Fractal Trees®.
9
10
LSM & Fractal Tree
LSM tree & Fractal tree
• Shift balance from optimal reads toward faster writes
11
Fractal Trees
Fractal Trees
• Invented ~ 2007
• Tokutek and TokuDB as commercial engine
• 2015 – part of Percona
13
Fractal Tree
• Delay writes (send messages)
• Combine multiple delayed writes into single IO
• => SELECTs have much work to do
• Walk through all messages
14
15
Fractal tree benefits
• Tables that have a lot of indexes (preferably non-unique
indexes)
• Heavy write workload into the tables
• Systems with slow storage times
• Saving space when the environment storage is fast but
expensive.
16
From idea to reality
• Need concurrency-control mechanisms
• Need crash safety
• Need transactions, logging+recovery
• Need to support multithreading.
• Need to integrate with MySQL API layer
• Not everything perfect yet
17
Fractal Tree Internals
On MySQL Level:
CREATE TABLE metrics (
ts timestamp,
device_id int,
metric_id int,
cnt int,
val double,
PRIMARY KEY (ts, device_id, metric_id),
KEY metric_id (metric_id, ts),
KEY device_id (device_id, ts)
)
19
Internally 3 trees
• Primary Key (ts, device_id, metric_id) => data
• Key (metric_id, ts) => PK (ts, device_id, metric_id)
• Key (device_id, ts) => PK (ts, device_id, metric_id)
• Notice – long PK adds overhead
20
Root Node
21
F – tokudb_fanout (default 16)
Tokudb_block_size (default 4MB)
Basement node (leaf)
22
• Tokudb_read_block_size (default 64KB)
• Chunk used for compression/decompression
• Smaller size is better for point lookups
23
Shape your tree (settings per TABLE)
• tokudb_block_size (default 4MiB)
• size of Node IN Memory (on disk it will be compressed)
• tokudb_read_block_size (default 64KiB)
• size of basement node - minimal reading block size, also block size for
compression
• Balance: smaller tokudb_read_block_size - better for Point Reads, but
leads for more random IO
• tokudb_fanout (default 16) - defines maximal amount of
pivots per non-leaf node. (amount of pivots = tokudb_fanout-1)
24
Recommendations
tokudb_block_size:
4MiB block size is good for spinning disk.
For SSD smaller block size might be beneficial, I often use 1MiB
In reality 64-128KiB should be even better, but TokuDB does not
handle these properly (performance bug: linear search of a free
block in fragmented storage)
25
Recommendations
tokudb_read_block_size:
Recommended to set 16KiB if you expect point queries (again,
too bad this setting is per-table, not per-index)
26
How to see the shape of the tree
tokuftdump --summary
27
28
tokuftdump --summary
leaf nodes: 6797
non-leaf nodes: 97
Leaf size: 4,278,632,448
Total size: 4,286,052,352
Total uncompressed size: 6,231,518,882
Messages count: 70155
Messages size: 10,535,155
Records count: 30000000
Tree height: 2
height: 0, nodes count: 6797; avg children/node: 59.364131
basement nodes: 403498; msg size: 0; disksize: 4,278,632,448; uncompressed size: 6,220,381,082; ratio:
1.453825
height: 1, nodes count: 96; avg children/node: 70.802083
msg cnt: 65001; msg size: 9,756,907; disksize: 6,907,904; uncompressed size: 10,334,469; ratio: 1.496035
height: 2, nodes count: 1; avg children/node: 96.000000
msg cnt: 5154; msg size: 778,248; disksize: 512,000; uncompressed size: 803,331; ratio: 1.569006
29
FT properties
• “Delay writes” for as long as possible =>
• writes are amortized into 1 single big write instead of N random writes
• May result in serious liability: huge amount of messages not merged to
leaf-nodes
• SELECT will require traversing through all messages
• Especially bad for point SELECT queries
• Remember: Primary Key or Unique Key constraints REQUIRE a
HIDDEN POINT SELECT lookup
• UNIQUE KEY - Performance Killer for TokuDB
• non-sequential PRIMARY KEY - Performance Killer for TokuDB
30
Implication of slow selects
• Unique keys – background checks – implicit reads
• Foreign Keys – background checks (not supported in
TokuDB)
• Select by index – requires two lookups
31
Covering indexes
• SELECT user_name
FROM users
WHERE user_email=’sherlock@holmes.guru’
• Instead of INDEX (user_email) =>
• INDEX (user_email, user_name)
32
When to use Fractal Tree?
• Table with many indexes (better if not UNIQUE), intensive
writes into this table
• Slow storage
• Saving space of fast expensive storage
• Less write amplification (good for SSD health)
• Cloud instances are often good fit: storage either slow, or
expensive when fast.
33
Benchmarks
34
35
36
Stories on PerconaFT internals
Section Information
Eviction
• Algorithm to maintain cached nodes within limit
38
Eviction
• tokudb_cache_size - Amount of memory TokuDB allocates for
nodes in memory.
• TokuDB’s term is “CACHETABLE”, status variables
• show global status like '%CACHETABLE%';
• Eviction - background process to keep memory consumption <=
tokudb_cache_size.
• It starts in only when size_of(nodes_in_memory) > tokudb_cache_size
• TokuDB will use more memory than tokudb_cache_size,
• User thread will be stopped if used memory > tokudb_cache_size*1.2
39
Eviction algorithm
CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory.
Eviction algorithm in simple steps:
• If size_of(nodes_in_memory) > tokudb_cache_size
Find victim to remove from memory
Node with smallest access_count is removed (evicted)
If Node is DIRTY - node is sent into background process to write on disk
Tokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write
queue
• Potential memory consumption is tokudb_cache_size +
Tokudb_CACHETABLE_SIZE_WRITING
40
Partial eviction
• For non-leaf non-dirty nodes Evictor may choose to perform
partial eviction
• 2 stage of partial evictions:
• Compress a part of node
• If still not-used, remove from memory
• Variables to controls this:
• tokudb_enable_partial_eviction
• tokudb_compress_buffers_before_eviction
41
Partial eviction
42
43
TokuDB Compression
• Only non-compressed data stored in memory (unless
partial compressed part of non-leaf node).
• It seems beneficial to use OS cache as a secondary cache
for compressed nodes, for this:
• tokudb_directio=OFF
• USE cgroups to limit total memory usage by mysqld process
44
Checkpointing
• Checkpointing - is the periodical process to get datafiles in sync with transactional
redo log files.
• show global status like '%CHECKPOINT%';
• In TokuDB checkpointing is time-based, in InnoDB - log file size based.
• In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done.
• Checkpointing interval in TokuDB:
• tokudb_checkpointing_period=N sec
45
46
47
Checkpoint algorithm
• START CHECKPOINT;
• begin_checkpoint; ←- all transactions are stalled
• mark all nodes in memory as PENDING;
• end_begin_checkpoint;
• Checkpoint thread: go through all PENDING nodes; if dirty - write to disk
• User threads: if user query faces PENDING node; node is CLONED and put into background
checkpoint thread pool
• By default checkpoint thread pool size (amount of threads) = CPU CORES / 4.
• That is 4 threads on 16 cores servers.
• In CPU bound workload it takes 25% of CPU power from user threads!!!!
• Variable: tokudb_checkpoint_pool_threads=N
48
49
Few words on LSM
Section Information
LSM tree
• Older than Fractal Tree
• Google BigTable as primary driver of interest
• Cassandra
• RocksDB
• MongoRocks
• MyRocks
51
Instead of final summary
• Alternative data structures have their place
• Use wisely, know limitations
• A lot of work ahead
52
Thank you!
• My contact
• Vadim@percona.com
• @VadimTK
53

More Related Content

What's hot

Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Masao Fujii
 
InnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter ZaitsevInnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter ZaitsevFuenteovejuna
 
MyDUMPER : Faster logical backups and restores
MyDUMPER : Faster logical backups and restores MyDUMPER : Faster logical backups and restores
MyDUMPER : Faster logical backups and restores Mydbops
 
PostgreSQL Hangout Parameter Tuning
PostgreSQL Hangout Parameter TuningPostgreSQL Hangout Parameter Tuning
PostgreSQL Hangout Parameter TuningAshnikbiz
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014Ryusuke Kajiyama
 
What'sNnew in 3.0 Webinar
What'sNnew in 3.0 WebinarWhat'sNnew in 3.0 Webinar
What'sNnew in 3.0 WebinarMongoDB
 
MySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery PlanningMySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery PlanningLenz Grimmer
 
Architecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres ClusterArchitecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres ClusterAshnikbiz
 
What's New In PostgreSQL 9.4
What's New In PostgreSQL 9.4What's New In PostgreSQL 9.4
What's New In PostgreSQL 9.4Pavan Deolasee
 
PostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesPostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesAshnikbiz
 
Is It Fast? : Measuring MongoDB Performance
Is It Fast? : Measuring MongoDB PerformanceIs It Fast? : Measuring MongoDB Performance
Is It Fast? : Measuring MongoDB PerformanceTim Callaghan
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALEPostgreSQL Experts, Inc.
 
WiredTiger MongoDB Integration
WiredTiger MongoDB Integration WiredTiger MongoDB Integration
WiredTiger MongoDB Integration MongoDB
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASAshnikbiz
 
MySQL Enterprise Backup (MEB)
MySQL Enterprise Backup (MEB)MySQL Enterprise Backup (MEB)
MySQL Enterprise Backup (MEB)Mydbops
 
An introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP EngineAn introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP EngineKrishnakumar S
 
Fractal Tree Indexes : From Theory to Practice
Fractal Tree Indexes : From Theory to PracticeFractal Tree Indexes : From Theory to Practice
Fractal Tree Indexes : From Theory to PracticeTim Callaghan
 

What's hot (20)

Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
Streaming Replication (Keynote @ PostgreSQL Conference 2009 Japan)
 
InnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter ZaitsevInnoDB Architecture and Performance Optimization, Peter Zaitsev
InnoDB Architecture and Performance Optimization, Peter Zaitsev
 
Fudcon talk.ppt
Fudcon talk.pptFudcon talk.ppt
Fudcon talk.ppt
 
MyDUMPER : Faster logical backups and restores
MyDUMPER : Faster logical backups and restores MyDUMPER : Faster logical backups and restores
MyDUMPER : Faster logical backups and restores
 
PostgreSQL Hangout Parameter Tuning
PostgreSQL Hangout Parameter TuningPostgreSQL Hangout Parameter Tuning
PostgreSQL Hangout Parameter Tuning
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014
 
What'sNnew in 3.0 Webinar
What'sNnew in 3.0 WebinarWhat'sNnew in 3.0 Webinar
What'sNnew in 3.0 Webinar
 
MySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery PlanningMySQL Server Backup, Restoration, and Disaster Recovery Planning
MySQL Server Backup, Restoration, and Disaster Recovery Planning
 
Architecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres ClusterArchitecture for building scalable and highly available Postgres Cluster
Architecture for building scalable and highly available Postgres Cluster
 
What's New In PostgreSQL 9.4
What's New In PostgreSQL 9.4What's New In PostgreSQL 9.4
What's New In PostgreSQL 9.4
 
PostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use CasesPostreSQL HA and DR Setup & Use Cases
PostreSQL HA and DR Setup & Use Cases
 
Is It Fast? : Measuring MongoDB Performance
Is It Fast? : Measuring MongoDB PerformanceIs It Fast? : Measuring MongoDB Performance
Is It Fast? : Measuring MongoDB Performance
 
PostgreSQL Replication in 10 Minutes - SCALE
PostgreSQL Replication in 10  Minutes - SCALEPostgreSQL Replication in 10  Minutes - SCALE
PostgreSQL Replication in 10 Minutes - SCALE
 
PostgreSQL and MySQL
PostgreSQL and MySQLPostgreSQL and MySQL
PostgreSQL and MySQL
 
WiredTiger MongoDB Integration
WiredTiger MongoDB Integration WiredTiger MongoDB Integration
WiredTiger MongoDB Integration
 
Technical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPASTechnical Introduction to PostgreSQL and PPAS
Technical Introduction to PostgreSQL and PPAS
 
MySQL Enterprise Backup (MEB)
MySQL Enterprise Backup (MEB)MySQL Enterprise Backup (MEB)
MySQL Enterprise Backup (MEB)
 
An introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP EngineAn introduction to SQL Server in-memory OLTP Engine
An introduction to SQL Server in-memory OLTP Engine
 
Running MySQL on Linux
Running MySQL on LinuxRunning MySQL on Linux
Running MySQL on Linux
 
Fractal Tree Indexes : From Theory to Practice
Fractal Tree Indexes : From Theory to PracticeFractal Tree Indexes : From Theory to Practice
Fractal Tree Indexes : From Theory to Practice
 

Viewers also liked

XtraDB 5.6 and 5.7: Key Performance Algorithms
XtraDB 5.6 and 5.7: Key Performance AlgorithmsXtraDB 5.6 and 5.7: Key Performance Algorithms
XtraDB 5.6 and 5.7: Key Performance AlgorithmsLaurynas Biveinis
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsGreg Makowski
 
Mongo performance tuning: tips and tricks
Mongo performance tuning: tips and tricksMongo performance tuning: tips and tricks
Mongo performance tuning: tips and tricksVladimir Malyk
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Concurrency Patterns with MongoDB
Concurrency Patterns with MongoDBConcurrency Patterns with MongoDB
Concurrency Patterns with MongoDBYann Cluchey
 

Viewers also liked (7)

XtraDB 5.6 and 5.7: Key Performance Algorithms
XtraDB 5.6 and 5.7: Key Performance AlgorithmsXtraDB 5.6 and 5.7: Key Performance Algorithms
XtraDB 5.6 and 5.7: Key Performance Algorithms
 
Using Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical ApplicationsUsing Deep Learning to do Real-Time Scoring in Practical Applications
Using Deep Learning to do Real-Time Scoring in Practical Applications
 
Intro to column stores
Intro to column storesIntro to column stores
Intro to column stores
 
Mongo performance tuning: tips and tricks
Mongo performance tuning: tips and tricksMongo performance tuning: tips and tricks
Mongo performance tuning: tips and tricks
 
MyRocks Deep Dive
MyRocks Deep DiveMyRocks Deep Dive
MyRocks Deep Dive
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Concurrency Patterns with MongoDB
Concurrency Patterns with MongoDBConcurrency Patterns with MongoDB
Concurrency Patterns with MongoDB
 

Similar to Percona FT / TokuDB

August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation Yahoo Developer Network
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Community
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchJoe Alex
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudRose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache KuduAndriy Zabavskyy
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness MongoDB
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In DepthFabio Fumarola
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Javamalduarte
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
Investigate SQL Server Memory Like Sherlock Holmes
Investigate SQL Server Memory Like Sherlock HolmesInvestigate SQL Server Memory Like Sherlock Holmes
Investigate SQL Server Memory Like Sherlock HolmesRichard Douglas
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data WarehousesConnor McDonald
 
Full Table Scan: friend or foe
Full Table Scan: friend or foeFull Table Scan: friend or foe
Full Table Scan: friend or foeMauro Pagano
 
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp0220140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02Francisco Gonçalves
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architecturesYoung Alista
 

Similar to Percona FT / TokuDB (20)

Cassandra training
Cassandra trainingCassandra training
Cassandra training
 
August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation August 2013 HUG: Removing the NameNode's memory limitation
August 2013 HUG: Removing the NameNode's memory limitation
 
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
Ceph Day Amsterdam 2015: Measuring and predicting performance of Ceph clusters
 
Managing Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using ElasticsearchManaging Security At 1M Events a Second using Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Deployment Preparedness
Deployment Preparedness Deployment Preparedness
Deployment Preparedness
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth7. Key-Value Databases: In Depth
7. Key-Value Databases: In Depth
 
High Performance With Java
High Performance With JavaHigh Performance With Java
High Performance With Java
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
A12 vercelletto indexing_techniques
A12 vercelletto indexing_techniquesA12 vercelletto indexing_techniques
A12 vercelletto indexing_techniques
 
Investigate SQL Server Memory Like Sherlock Holmes
Investigate SQL Server Memory Like Sherlock HolmesInvestigate SQL Server Memory Like Sherlock Holmes
Investigate SQL Server Memory Like Sherlock Holmes
 
L6.sp17.pptx
L6.sp17.pptxL6.sp17.pptx
L6.sp17.pptx
 
Real World Performance - Data Warehouses
Real World Performance - Data WarehousesReal World Performance - Data Warehouses
Real World Performance - Data Warehouses
 
Full Table Scan: friend or foe
Full Table Scan: friend or foeFull Table Scan: friend or foe
Full Table Scan: friend or foe
 
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp0220140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
20140128 webinar-get-more-out-of-mysql-with-tokudb-140319063324-phpapp02
 
Motivation for multithreaded architectures
Motivation for multithreaded architecturesMotivation for multithreaded architectures
Motivation for multithreaded architectures
 

Recently uploaded

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 

Recently uploaded (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 

Percona FT / TokuDB

  • 2. Agenda 2 Why new data structure Fractal Tree & LSM tree Internals of Fractal Tree When it is useful How to use it
  • 3. Why new data structure
  • 4. Before it was B-Tree • “Traditional” data structure • In the field from 1970-ies 4
  • 6. When B-Tree is good • When datasize doesn’t exceed memory limits • When the application is mostly performing read (SELECT) operations, or when read performance is more important than write performance 6
  • 7. When B-Tree is not good • As soon as the data size exceeds available memory, performance drops rapidly • Choosing a flash-based storage helps performance, but only to a certain extent -- in the long run, memory limits cause performance to suffer 7
  • 8. To summarize • B-tree was designed to provide optimal data retrieval performance, but not data updates (insert, delete, update) • This shortcoming created a need for data structures that provide better performance for data storage. 8
  • 9. Cases when B-Tree is not optimal • accepting and storing event logs • storing measurements from a high-frequency sensor, • tracking user clicks, and so on. • For such cases, two new data structures were created: log structured merge (LSM) trees and Fractal Trees®. 9
  • 11. LSM tree & Fractal tree • Shift balance from optimal reads toward faster writes 11
  • 13. Fractal Trees • Invented ~ 2007 • Tokutek and TokuDB as commercial engine • 2015 – part of Percona 13
  • 14. Fractal Tree • Delay writes (send messages) • Combine multiple delayed writes into single IO • => SELECTs have much work to do • Walk through all messages 14
  • 15. 15
  • 16. Fractal tree benefits • Tables that have a lot of indexes (preferably non-unique indexes) • Heavy write workload into the tables • Systems with slow storage times • Saving space when the environment storage is fast but expensive. 16
  • 17. From idea to reality • Need concurrency-control mechanisms • Need crash safety • Need transactions, logging+recovery • Need to support multithreading. • Need to integrate with MySQL API layer • Not everything perfect yet 17
  • 19. On MySQL Level: CREATE TABLE metrics ( ts timestamp, device_id int, metric_id int, cnt int, val double, PRIMARY KEY (ts, device_id, metric_id), KEY metric_id (metric_id, ts), KEY device_id (device_id, ts) ) 19
  • 20. Internally 3 trees • Primary Key (ts, device_id, metric_id) => data • Key (metric_id, ts) => PK (ts, device_id, metric_id) • Key (device_id, ts) => PK (ts, device_id, metric_id) • Notice – long PK adds overhead 20
  • 21. Root Node 21 F – tokudb_fanout (default 16) Tokudb_block_size (default 4MB)
  • 23. • Tokudb_read_block_size (default 64KB) • Chunk used for compression/decompression • Smaller size is better for point lookups 23
  • 24. Shape your tree (settings per TABLE) • tokudb_block_size (default 4MiB) • size of Node IN Memory (on disk it will be compressed) • tokudb_read_block_size (default 64KiB) • size of basement node - minimal reading block size, also block size for compression • Balance: smaller tokudb_read_block_size - better for Point Reads, but leads for more random IO • tokudb_fanout (default 16) - defines maximal amount of pivots per non-leaf node. (amount of pivots = tokudb_fanout-1) 24
  • 25. Recommendations tokudb_block_size: 4MiB block size is good for spinning disk. For SSD smaller block size might be beneficial, I often use 1MiB In reality 64-128KiB should be even better, but TokuDB does not handle these properly (performance bug: linear search of a free block in fragmented storage) 25
  • 26. Recommendations tokudb_read_block_size: Recommended to set 16KiB if you expect point queries (again, too bad this setting is per-table, not per-index) 26
  • 27. How to see the shape of the tree tokuftdump --summary 27
  • 28. 28
  • 29. tokuftdump --summary leaf nodes: 6797 non-leaf nodes: 97 Leaf size: 4,278,632,448 Total size: 4,286,052,352 Total uncompressed size: 6,231,518,882 Messages count: 70155 Messages size: 10,535,155 Records count: 30000000 Tree height: 2 height: 0, nodes count: 6797; avg children/node: 59.364131 basement nodes: 403498; msg size: 0; disksize: 4,278,632,448; uncompressed size: 6,220,381,082; ratio: 1.453825 height: 1, nodes count: 96; avg children/node: 70.802083 msg cnt: 65001; msg size: 9,756,907; disksize: 6,907,904; uncompressed size: 10,334,469; ratio: 1.496035 height: 2, nodes count: 1; avg children/node: 96.000000 msg cnt: 5154; msg size: 778,248; disksize: 512,000; uncompressed size: 803,331; ratio: 1.569006 29
  • 30. FT properties • “Delay writes” for as long as possible => • writes are amortized into 1 single big write instead of N random writes • May result in serious liability: huge amount of messages not merged to leaf-nodes • SELECT will require traversing through all messages • Especially bad for point SELECT queries • Remember: Primary Key or Unique Key constraints REQUIRE a HIDDEN POINT SELECT lookup • UNIQUE KEY - Performance Killer for TokuDB • non-sequential PRIMARY KEY - Performance Killer for TokuDB 30
  • 31. Implication of slow selects • Unique keys – background checks – implicit reads • Foreign Keys – background checks (not supported in TokuDB) • Select by index – requires two lookups 31
  • 32. Covering indexes • SELECT user_name FROM users WHERE user_email=’sherlock@holmes.guru’ • Instead of INDEX (user_email) => • INDEX (user_email, user_name) 32
  • 33. When to use Fractal Tree? • Table with many indexes (better if not UNIQUE), intensive writes into this table • Slow storage • Saving space of fast expensive storage • Less write amplification (good for SSD health) • Cloud instances are often good fit: storage either slow, or expensive when fast. 33
  • 35. 35
  • 36. 36
  • 37. Stories on PerconaFT internals Section Information
  • 38. Eviction • Algorithm to maintain cached nodes within limit 38
  • 39. Eviction • tokudb_cache_size - Amount of memory TokuDB allocates for nodes in memory. • TokuDB’s term is “CACHETABLE”, status variables • show global status like '%CACHETABLE%'; • Eviction - background process to keep memory consumption <= tokudb_cache_size. • It starts in only when size_of(nodes_in_memory) > tokudb_cache_size • TokuDB will use more memory than tokudb_cache_size, • User thread will be stopped if used memory > tokudb_cache_size*1.2 39
  • 40. Eviction algorithm CACHETABLE uses GCLOCK algorithm (not LRU) to manage nodes in memory. Eviction algorithm in simple steps: • If size_of(nodes_in_memory) > tokudb_cache_size Find victim to remove from memory Node with smallest access_count is removed (evicted) If Node is DIRTY - node is sent into background process to write on disk Tokudb_CACHETABLE_SIZE_WRITING - size of nodes in background write queue • Potential memory consumption is tokudb_cache_size + Tokudb_CACHETABLE_SIZE_WRITING 40
  • 41. Partial eviction • For non-leaf non-dirty nodes Evictor may choose to perform partial eviction • 2 stage of partial evictions: • Compress a part of node • If still not-used, remove from memory • Variables to controls this: • tokudb_enable_partial_eviction • tokudb_compress_buffers_before_eviction 41
  • 43. 43
  • 44. TokuDB Compression • Only non-compressed data stored in memory (unless partial compressed part of non-leaf node). • It seems beneficial to use OS cache as a secondary cache for compressed nodes, for this: • tokudb_directio=OFF • USE cgroups to limit total memory usage by mysqld process 44
  • 45. Checkpointing • Checkpointing - is the periodical process to get datafiles in sync with transactional redo log files. • show global status like '%CHECKPOINT%'; • In TokuDB checkpointing is time-based, in InnoDB - log file size based. • In InnoDB checkpointing is fuzzy. In TokuDB it starts by timer and runs until it is done. • Checkpointing interval in TokuDB: • tokudb_checkpointing_period=N sec 45
  • 46. 46
  • 47. 47
  • 48. Checkpoint algorithm • START CHECKPOINT; • begin_checkpoint; ←- all transactions are stalled • mark all nodes in memory as PENDING; • end_begin_checkpoint; • Checkpoint thread: go through all PENDING nodes; if dirty - write to disk • User threads: if user query faces PENDING node; node is CLONED and put into background checkpoint thread pool • By default checkpoint thread pool size (amount of threads) = CPU CORES / 4. • That is 4 threads on 16 cores servers. • In CPU bound workload it takes 25% of CPU power from user threads!!!! • Variable: tokudb_checkpoint_pool_threads=N 48
  • 49. 49
  • 50. Few words on LSM Section Information
  • 51. LSM tree • Older than Fractal Tree • Google BigTable as primary driver of interest • Cassandra • RocksDB • MongoRocks • MyRocks 51
  • 52. Instead of final summary • Alternative data structures have their place • Use wisely, know limitations • A lot of work ahead 52
  • 53. Thank you! • My contact • Vadim@percona.com • @VadimTK 53