SlideShare a Scribd company logo
MariaDB
ColumnStore
Serge Frezefond
Cloud Solution Architect
Agenda
Session 1
Overview
Architecture
Columnar, Relational vs Columnar
InfiniDB Columnar Data Storage
Session 2
Installation
Operations
Session 3
Massively Parallel Query Execution
ETL
Overview Architecture
Columnar
Distributed
Storage
MariaDB ColumnStore
• GPLv2 Open Source
• Columnar, Massively Parallel
MariaDB Storage Engine
• Scalable, high-performance
analytics platform
• Built in redundancy and
high availability
• Runs on premise, on AWS cloud
• Full SQL syntax and capabilities
regardless of platform
Big Data Sources Analytics Insight
MariaDB ColumnStore
. . .
Node 1 Node 2 Node 3 Node N
Local / AWS® / GlusterFS ®
ELT
Tools
BI
Tools
MariaDB ColumnStore Architecture
Columnar Distributed Data Storage
User Connections
User Module nUser Module 1
Performance
Module n
Performance
Module 2
Performance
Module 1
MariaDB
Front End
Query Engine
User Module
Processes SQL Requests
Performance Module
Distributed Processing Engine
MariaDB ColumnStore
High performance columnar storage engine that support wide variety of
analytical use cases with SQL in a highly scalable distributed environments
Parallel query
processing for
distributed
environments
Faster, More
Efficient Queries
Single SQL
Interface for OLTP
and analytics
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance
OLTP/NoSQL
Workloads
Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows.
OLAP/Analytic/
Reporting Workloads
Workload – Query Vision/Scope
1 100 10,000
10-100GB
10,000,000,000
1-10TB
1,000,000 100,000,000
100-1,000GB
MariaDB ColumnStore
MariaDB Functions
• MariaDB Client
• MariaDB Connectivity (JDBC, ODBC)
• MariaDB Security
• Initial SQL Statement Parsing
• Initial SQL Optimization < Custom Handler Class >
• Execute final sort and final limit
• Display final results
ExeMgr Functions
• SQL Optimization
• Distribute work for scan, filter, join, functions,
expressions, group by, aggregation, etc. to all available
Performance Modules to be run in parallel
• Collect the results returned by the Performance Modules
• Return the final results to MySQL for display
MariaDB
ColumnStore
ExeMgr
Columnar Distributed Data Storage
User Connections
User Module nUser Module 1
Performance
Module n
Performance
Module 2
Performance
Module 1
MPP
User Module
Processes SQL Requests
MySQL Front End
Performance Module
Executes the Queries
Distributed Processing Engine
MariaDB ColumnStore
MariaDB ColumnStore
uses standard
“Engine=columnstore”
syntax
mysql> use tpcds_djoshi
Database changed
mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 2880404 |
+----------+
1 row in set (1.68 sec)
mysql> describe warehouse;
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| w_warehouse_sk | int(11) | NO | | NULL | |
| w_warehouse_id | char(16) | NO | | NULL | |
| w_warehouse_name | varchar(20) | YES | | NULL | |
| w_warehouse_sq_ft | int(11) | YES | | NULL | |
| w_street_number | char(10) | YES | | NULL | |
| w_street_name | varchar(60) | YES | | NULL | |
| w_street_type | char(15) | YES | | NULL | |
| w_suite_number | char(10) | YES | | NULL | |
| w_city | varchar(60) | YES | | NULL | |
| w_county | varchar(30) | YES | | NULL | |
| w_state | char(2) | YES | | NULL | |
| w_zip | char(10) | YES | | NULL | |
| w_country | varchar(20) | YES | | NULL | |
| w_gmt_offset | decimal(5,2) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
14 rows in set (0.05 sec)
CREATE TABLE `game_warehouse`.`dim_title` (
`id` INT,
`name` VARCHAR(45),
`publisher` VARCHAR(45),
`release_date` DATE,
`language` INT,
`platform_name` VARCHAR(45),
`version` VARCHAR(45)
) ENGINE=columnstore;
Uses custom scalable
columnar architecture
MariaDB ColumnStore
mysql> use tpcds_djoshi
Database changed
mysql> select count(*) from store_sales;
+----------+
| count(*) |
+----------+
| 2880404 |
+----------+
1 row in set (1.68 sec)
mysql> describe warehouse;
+-------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| w_warehouse_sk | int(11) | NO | | NULL | |
| w_warehouse_id | char(16) | NO | | NULL | |
| w_warehouse_name | varchar(20) | YES | | NULL | |
| w_warehouse_sq_ft | int(11) | YES | | NULL | |
| w_street_number | char(10) | YES | | NULL | |
| w_street_name | varchar(60) | YES | | NULL | |
| w_street_type | char(15) | YES | | NULL | |
| w_suite_number | char(10) | YES | | NULL | |
| w_city | varchar(60) | YES | | NULL | |
| w_county | varchar(30) | YES | | NULL | |
| w_state | char(2) | YES | | NULL | |
| w_zip | char(10) | YES | | NULL | |
| w_country | varchar(20) | YES | | NULL | |
| w_gmt_offset | decimal(5,2) | YES | | NULL | |
+-------------------+--------------+------+-----+---------+-------+
14 rows in set (0.05 sec)
MariaDB Front End
Standard ANSI SQL
Process Functionality Value
MariaDB
• Hosts MariaDB
• Connection management
• SQL parsing & optimization
Familiar DBMS interface
Leverages existing partner integrations
Delivers rich SQL syntax support
Extent Map
• Abstracts physical
and logical storage
• Metadata store
Enables partition elimination
ExeMgr
• Work distribution
• Final results management
and aggregation
Multi-threaded to take advantage
of multi-core HW platforms
User Module at a Glance
Process Functionality Value
PrimProc
• Scale-out cache management
• Distributed scan, filter, join
and aggregation operations
• Resource management
Independent scalability and
tunable performance
Multi-threaded to take advantage
of multi-core HW platforms
Data
• High Speed Bulk Load
• Transactional DML and DDL
• Online schema extensions
Non-blocking read enabled
Multi-threaded to take advantage
of multi-core HW platforms
Performance Module at a Glance
Columnar General Best Practices
Not suited for OLTP
Micro-batch load allows for near real-time behavior
Infrequently used columns do not impact other queries
Columnar suitable for sparse columns (nulls compress nicely)
Data Modeling Best Practices
Star-schema optimizations are generally a good idea
Conservative data typing is very important
Especially around fixed-length vs. dictionary boundary (8 bytes)
IP Address vs. IP Number
Break down compound fields into individual fields:
Trivializes searching for sub-fields
Can avoid dictionary overhead
Cost to re-assemble is generally small
Compression with Data Storage Layer
Blocks (8KB)
Extent1
(8MB~64MB
8 million rows)
Logical
Layer
Segment File1
(maps to an Extent)
Physical
Layer
Compression
Chunks
Data Load and Extents (local load)
8 million rows
1st Data Load
CSV File
Data Range
1 ~ 200
Rows 16 million
2nd Data Load
New CSV File
Data Range
150 ~ 210
Rows 16 million +8
Data Load
Data Load
Extent 1
Min 1, Max 200
Extent 2
Min 1, Max 200
8 million rows
8 million rows
Extent 3
Min 150, Max 210
Extent 4
Min 150, Max 210
8 million rows
Extent 5
Min 150, Max 210
8 million rows
Key meta-structure that powers MariaDB ColumnStore’s
performance
A catalog of all extents
• Minimum and maximum values for a column’s data within an extent
• Corresponding blocks for each extent
Master copy of the Extent Map on primary PM node
Upon system startup, copied to all other UM and PM
nodes for disaster recovery and failover purposes
Extent Map resident in memory for quick access at all nodes
As extents modified, updates broadcasted to all participating nodes
Stores about 64 bytes for each 8-64 Mbytes on disk
Extent Map
Extent Map
When performing queries:
• Eliminate the extents by taking into consideration only
the extents for the column in join and filter conditions
• Use the minimum and maximum value for the extents for
join columns to filter the columns and eliminate extent
Multiple columns can be used
together for partition elimination
Transitive properties apply, i.e. a filter
on a dimension column (date, for example)
can allow for partition elimination on fact table
• 8-byte fixed length token (pointer).
• A variable length value stored at the
location identified by the pointer.
Data Types
1-byte Field
with 8192 values per
8k block
2-byte Field
with 4096 values
per 8k block
4-byte Field
with 2048 values
per 8k block
8-byte Field
with 1024 values per
8k block
Dictionary structure
made up of 2
files/extents with:
At the physical layer, all columns are stored as:
• Varchar(8) or larger
• Char(9) or larger
Data Types
1-byte Field
Examples
TinyInt, Char(1)
2-byte Field
Examples
SmallInt, Char(2)
4-byte Field
Examples
Int, Char(3),
Char(4), date, float
8-byte Field
Examples
BigInt, Char(5-
8),datetime, real/double
Dictionary Examples
At the physical layer, all columns are stored as:
Sizing
Minimum Spec
UM
4 core,
32 G RAM PM
4 core,
16 G RAM
Typical Server spec
PM
8 core 64G RAM
UM
8 core, 264G RAM
Data Storage
External Data Volumes
• Maximum 2 data volume per IO
channel per PM node server
• up to 2TB on the disk per data
volume ≈ Max 4 TB per PM node
Local disk
Up to 2TB on the disk per
PM node server
DETAILED SIZING GUIDE
based on data size
and workload
Sizing - Example
• MariaDB ColumnStore 60TB uncompressed data =
6TB compressed data at 10x compression
• 2UM - 8 core 512G(based on work load)
• 6 TB compressed = 3 data volume (at 2TB per volume)
- with 1 data volume per PM node - 3PMs
• Data growth - 2TB per month, Data retention - 2 years
- Plan for 2TB X24 = 48 TB additional
- 48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume)
with 1 data volume per PM node - 3 additional PMs
• Total 6 PMs, 2 UMs
Analytics with
MariaDB
ColumnStore
SQL Features
Aggregation
Windowing Functions
UDF
Tuning
Commands
ETL
SQL Features
Source : InfiniDB SQL Syntax Guide
Cross Engine
Joins
UDF
DML
Aggregation
DDL
Disk Based
Joins
Windowing
Functions
SELECT
QUERY
MAX RANK
MIN DENSE_RANK
COUNT PERCENT_RANK
SUM NTH_VALUE
AVG FIRST_VALUE
VARIANCE LAST_VALUE
VAR_POP CUME_DIST
VAR_SAMP LAG
STD LEAD
STDDEV NTILE
STDDEV_POP PERCENTILE_CONT
STDDEV_SAMP PERCENTILE_DISC
ROW_NUMBER MEDIAN
• Aggregate over a series of related rows
• Simplified function for complex statistical
analytics over sliding window per row
- Cumulative, moving or centered aggregates
- Simple Statistical functions like rank, max, min,
average, median
- More complex functions such as distribution,
percentile, lag, lead
- Without running complex sub-queries
Windowing Functions
Source : InfiniDB SQL Syntax Guide
Top N Visitors for each Month
Window Function Example
Total for Each
Visitor by Month
Top 1 :
Time_rank = 1
Top 2 :
Time_rank <= 2
Top N :
Time_rank <= N
Complex Window Function Example
Website Visitor Order Table
Outlier Limits
Quartile_1 = 1750
Quartile_3 = 2837.5
Median = 2650
Max_Val = 5000 , Min_Val = 300
Inter Quartile Range = Q3 – Q1 = 1087.5
Higher Control Limit = MIN(M – IQR*1.5, Max_Val) = 4281.25
Lower Control Limit = MAX(M + IQR*1.5, Min_Val) = 1018.75
Vistor_Id Order_Month Orders_Amount
1 January-2014 5,000
2 January-2014 1,000
3 January-2014 3,040
4 January-2014 2,000
5 January-2014 2,770
6 January-2014 2,750
7 January-2014 2,550
8 January-2014 300
1 February-2014 1,410
2 February-2014 293
10 February-2014 304
12 February-2014 314
*Discard Outlier visitors by spending for each month
Tuning Commands
mysql> select count(*) from part;
+-----------+
| count(*) |
+-----------+
| 200000000 |
+-----------+
1 row in set (0.48 sec)
mysql> select calgetstats();
+-------------------------------------------------------------------------------
| calgestats()
--------------------------------------------------------------------------------
| Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0MB; PhyI/O-0; CacheI/
O-98039;
+-------------------------------------------------------------------------------
-------------------------------------------------------------------------------+
BlocksTouched-97658; CasPartBlks-0; MsgBytesIn-2MB; MsgBytesOut-0MB| 1242146662
640516 |
-------------------------------------------------------------------------------+
Calgetstats: Information On The Last Query Executed Within A Given Session
select 'BRAZIL', d_year, lo_tax, p_size, s_region, count(*)
from dateinfo, part, supplier, lineorder
where s_suppkey = lo_suppkey
and d_datekey = lo_orderdate
and p_partkey = lo_partkey
and lo_orderdate between 19980101 and 19981231
and s_nation = 'BRAZIL'
and p_size <> 23
group by 1,2,3,4,5
order by 1,2,3,4,5;
mysql> select calgettrace();
Tuning Commands
Calgetstats: Detailed distributed query execution plan
Tuning Commands
Query Statistics
Users can view the query statistics by selecting the rows from the
query stats table in the infinidb_querystats schema.
Example 1 Example 2
List execution time, rows returned
for all the select queries within
the past 12 hours
select queryid, query, endtime-starttime,
rows from querystats where starttime >=
now() - interval 12 hour and querytype =
'SELECT';
List the average, min and max
running time of all the INSERT SELECT
queries within the past 12 hours
select min(endtime-starttime), max(endtime-starttime),
avg(endtimestarttime) from querystats where
querytype='INSERT SELECT' and starttime >=
now() - interval 12 hour;
calpont> getActiveSQLStatements
getactivesqlstatements Wed Oct 7 08:38:32 2015
Get List of Active SQL Statements
=================================
Start Time Time (hh:mm:ss) Session ID SQL Statement
---------------- ---------------- -------------------- --------------------------------------------------
Oct 7 08:38:30 00:00:03 73 select c_name, sum(lo_revenue) from customer,
lineorder where lo_custkey = c_custkey and c_custkey = 6 group by c_name
getActiveSQLStatements: List Active SQL Statements within the System
mysql> show processlist;
+----+------+-----------+-------+---------+------+-------+--------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+-------+---------+------+-------+--------------+
| 73 | root | localhost | ssb10 | Query | 0 | NULL | show processlist
+----+------+-----------+-------+---------+------+-------+--------------+
1 row in set (0.01 se
Tuning Commands
ETL
32
Bulk Data Load
cpimport, LOAD DATA INFILE
Bulk Data Export
mysql client, odbc, jdbc
Integration with MariaDB
ColumnStore cpimport and sql
interface
Bulk Data Load: cpimport
• Fastest way to load data into MariaDB ColumnStore
• Load data from CSV file
cpimport dbName tblName [loadFile]
• Load data from Standard Input
mysql -e 'select * from source_table;' -N db2 | cpimport destination_db
destination_tbl -s 't‘
• Load data from Binary Source file
cpimport -I1 mydb mytable sourcefile.bin
• Multiple tables in can be loaded in parallel by launching multiple jobs
• Read queries continue without being blocked
• Successful cpimport is auto-committed
• In case of errors, entire load is rolled back
Bulk Data Load: cpimport mode 1
Single file Central Input :
Data source at UM
cpimport -m1 mytest mytable
mytable.tbl
cpimport
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Bulk Data Load: cpimport mode 2
Distributed Input:
Data Source at PMs
Partitioned load
file on each PM
cpimport -m2 testdb mytable
/home/mydata/mytable.tbl
cpimport
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Source Source
Distributed Input:
Data Source at PMs
Partitioned load
file on each PM
cpimport -m2 testdb mytable
/home/mydata/mytable.tbl
Bulk load command
at one or more PM
cpimport –m3 testdb mytable
/home/mydata/mytable.tbl
Bulk Data Load: cpimport mode 3
Name Node
UM Node
Source
Data Node
PM Node
Data Node
PM Node
Data Node
PM Node
Source Source
cpimport cpimport cpimport
Traditional way of
importing data into
any MariaDB storage
engine table
Bulk Data Load:
LOAD DATA INFILE
Up to 2 times slower
than cpimport for
large size imports
mysql> load data infile '/tmp/
outfile1.txt' into table destinationTable;
Query OK, 9765625 rows affected
(2 min 20.01 sec)
Records: 9765625 Deleted:
0 Skipped: 0 Warnings: 0
Either success or
error operation can
be rolled back
• Connect with ODBC, JDBC or
mysql client to the UM
• Extract SQL query results in
output file on the UM
Bulk Data Export
Distributed Export Central Export
• Fastest way to do export
• Use LOCAL PM query feature
• Connect ODBC, JDBC or mysql
client to each PM
• Extract SQL query results in
output file on each PM
Single Table Query
Star Schema Query
Star Schema Query – Example 2
Resources
Data Warehousing
Selective column
based queries
Large number
of dimensions
High Performance
Analytics On Large
Volume Of Data
Reporting and analysis
on millions or billions
of rows
From datasets
containing millions
to trillions of rows
Terabytes to Petabytes
of datasets
Analytics Require
Complex Joins,
Windowing Functions
Technical Use Cases
Industry Category Use Case
Gaming Behavior Analytics Projecting and predicting user behavior based on past and current data
Advertising Customer Analytics Customer behavior data for market segmentation and predictive analytics.
Advertising Loyalty Analytics Customer analytics focusing on a person’s commitment to a product, company, or brand.
Web, E-
commerce
Click Stream Analytics
Web activity analysis, software testing, market research with analytics on data about the clicks areas of web pages while
web browsing [Deal News]
Marketing Promotional Testing Using marketing and campaign management data to identify the best criteria to be used for a particular marketing offer.
Social Network Network Analytics Relationship analytics among network nodes
Financial Fraud Analytics
Monitoring user financial transactions and identifying patterns of behaviour to predict and detect abnormal or fraudulent
activity to prevent damage to user and institution.
Healthcare Patient Analytics Analyzing patient medical records to identify patterns to be used for improved medical treatment.
Healthcare Clinical Analytics Analyzing clinical data and its impact on patients to identify patterns to be used for improved medical treatment.
Telco
Network and Application
Performance Analytics
Streaming data from network devices and applications enriched with business operations data to uncover actionable
insights for network planning, operations and marketing analytics
Aviation Flight analytics
Proactively project parts replacement, maintenance and air-plane retirement based on real-time and historically collected
flight parameter data [Boeing]
Customer Use Cases
Thank you
Serge Frezefond
Cloud Solution Architect

More Related Content

What's hot

What's hot (20)

Oracle archi ppt
Oracle archi pptOracle archi ppt
Oracle archi ppt
 
NOSQL vs SQL
NOSQL vs SQLNOSQL vs SQL
NOSQL vs SQL
 
Introduction to Cassandra Basics
Introduction to Cassandra BasicsIntroduction to Cassandra Basics
Introduction to Cassandra Basics
 
ClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei MilovidovClickHouse Deep Dive, by Aleksei Milovidov
ClickHouse Deep Dive, by Aleksei Milovidov
 
OLTP+OLAP=HTAP
 OLTP+OLAP=HTAP OLTP+OLAP=HTAP
OLTP+OLAP=HTAP
 
Presentation db2 best practices for optimal performance
Presentation   db2 best practices for optimal performancePresentation   db2 best practices for optimal performance
Presentation db2 best practices for optimal performance
 
Histogram-in-Parallel-universe-of-MySQL-and-MariaDB
Histogram-in-Parallel-universe-of-MySQL-and-MariaDBHistogram-in-Parallel-universe-of-MySQL-and-MariaDB
Histogram-in-Parallel-universe-of-MySQL-and-MariaDB
 
A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides A Day in the Life of a ClickHouse Query Webinar Slides
A Day in the Life of a ClickHouse Query Webinar Slides
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Understanding and controlling transaction logs
Understanding and controlling transaction logsUnderstanding and controlling transaction logs
Understanding and controlling transaction logs
 
Oracle data guard for beginners
Oracle data guard for beginnersOracle data guard for beginners
Oracle data guard for beginners
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 
Storing time series data with Apache Cassandra
Storing time series data with Apache CassandraStoring time series data with Apache Cassandra
Storing time series data with Apache Cassandra
 
MySQL Performance Schema in MySQL 8.0
MySQL Performance Schema in MySQL 8.0MySQL Performance Schema in MySQL 8.0
MySQL Performance Schema in MySQL 8.0
 
Big Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon GershinskyBig Data Security in Apache Projects by Gidon Gershinsky
Big Data Security in Apache Projects by Gidon Gershinsky
 
introduction to NOSQL Database
introduction to NOSQL Databaseintroduction to NOSQL Database
introduction to NOSQL Database
 
An Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDBAn Introduction To NoSQL & MongoDB
An Introduction To NoSQL & MongoDB
 
Oracle architecture ppt
Oracle architecture pptOracle architecture ppt
Oracle architecture ppt
 
NOSQL Database: Apache Cassandra
NOSQL Database: Apache CassandraNOSQL Database: Apache Cassandra
NOSQL Database: Apache Cassandra
 
Oracle 12c Architecture
Oracle 12c ArchitectureOracle 12c Architecture
Oracle 12c Architecture
 

Similar to MariaDB ColumnStore

Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
christinemaritza
 

Similar to MariaDB ColumnStore (20)

Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
Big Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStoreBig Data Analytics with MariaDB ColumnStore
Big Data Analytics with MariaDB ColumnStore
 
04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore04 2017 emea_roadshowmilan_mariadb columnstore
04 2017 emea_roadshowmilan_mariadb columnstore
 
Database Sizing
Database SizingDatabase Sizing
Database Sizing
 
Introduction of MariaDB AX / TX
Introduction of MariaDB AX / TXIntroduction of MariaDB AX / TX
Introduction of MariaDB AX / TX
 
In memory databases presentation
In memory databases presentationIn memory databases presentation
In memory databases presentation
 
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
[db tech showcase OSS 2017] A23: Analytics with MariaDB ColumnStore by MariaD...
 
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
[db tech showcase OSS 2017] A25: Replacing Oracle Database at DBS Bank by Mar...
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
AWS Data Collection & Storage
AWS Data Collection & StorageAWS Data Collection & Storage
AWS Data Collection & Storage
 
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
Lec9 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part 1
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
23 October 2013 - AWS 201 - A Walk through the AWS Cloud: Introduction to Ama...
 
Top 10 tips for Oracle performance
Top 10 tips for Oracle performanceTop 10 tips for Oracle performance
Top 10 tips for Oracle performance
 
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docxChapter 8 1 Digital Design and Computer Architecture, 2n.docx
Chapter 8 1 Digital Design and Computer Architecture, 2n.docx
 

More from MariaDB plc

More from MariaDB plc (20)

MariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.xMariaDB Paris Workshop 2023 - MaxScale 23.02.x
MariaDB Paris Workshop 2023 - MaxScale 23.02.x
 
MariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - NewpharmaMariaDB Paris Workshop 2023 - Newpharma
MariaDB Paris Workshop 2023 - Newpharma
 
MariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - CloudMariaDB Paris Workshop 2023 - Cloud
MariaDB Paris Workshop 2023 - Cloud
 
MariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB EnterpriseMariaDB Paris Workshop 2023 - MariaDB Enterprise
MariaDB Paris Workshop 2023 - MariaDB Enterprise
 
MariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance OptimizationMariaDB Paris Workshop 2023 - Performance Optimization
MariaDB Paris Workshop 2023 - Performance Optimization
 
MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale MariaDB Paris Workshop 2023 - MaxScale
MariaDB Paris Workshop 2023 - MaxScale
 
MariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentationMariaDB Paris Workshop 2023 - novadys presentation
MariaDB Paris Workshop 2023 - novadys presentation
 
MariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentationMariaDB Paris Workshop 2023 - DARVA presentation
MariaDB Paris Workshop 2023 - DARVA presentation
 
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
MariaDB Tech und Business Update Hamburg 2023 - MariaDB Enterprise Server
 
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-BackupMariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
MariaDB SkySQL Autonome Skalierung, Observability, Cloud-Backup
 
Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023Einführung : MariaDB Tech und Business Update Hamburg 2023
Einführung : MariaDB Tech und Business Update Hamburg 2023
 
Hochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDBHochverfügbarkeitslösungen mit MariaDB
Hochverfügbarkeitslösungen mit MariaDB
 
Die Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise ServerDie Neuheiten in MariaDB Enterprise Server
Die Neuheiten in MariaDB Enterprise Server
 
Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®Global Data Replication with Galera for Ansell Guardian®
Global Data Replication with Galera for Ansell Guardian®
 
Introducing workload analysis
Introducing workload analysisIntroducing workload analysis
Introducing workload analysis
 
Under the hood: SkySQL monitoring
Under the hood: SkySQL monitoringUnder the hood: SkySQL monitoring
Under the hood: SkySQL monitoring
 
Introducing the R2DBC async Java connector
Introducing the R2DBC async Java connectorIntroducing the R2DBC async Java connector
Introducing the R2DBC async Java connector
 
MariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introductionMariaDB Enterprise Tools introduction
MariaDB Enterprise Tools introduction
 
Faster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDBFaster, better, stronger: The new InnoDB
Faster, better, stronger: The new InnoDB
 
The architecture of SkySQL
The architecture of SkySQLThe architecture of SkySQL
The architecture of SkySQL
 

Recently uploaded

How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 

Recently uploaded (20)

Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
Facemoji Keyboard released its 2023 State of Emoji report, outlining the most...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Designing for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web ServicesDesigning for Privacy in Amazon Web Services
Designing for Privacy in Amazon Web Services
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...
 
GraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysisGraphAware - Transforming policing with graph-based intelligence analysis
GraphAware - Transforming policing with graph-based intelligence analysis
 
Accelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with PlatformlessAccelerate Enterprise Software Engineering with Platformless
Accelerate Enterprise Software Engineering with Platformless
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 

MariaDB ColumnStore

  • 2. Agenda Session 1 Overview Architecture Columnar, Relational vs Columnar InfiniDB Columnar Data Storage Session 2 Installation Operations Session 3 Massively Parallel Query Execution ETL
  • 4. MariaDB ColumnStore • GPLv2 Open Source • Columnar, Massively Parallel MariaDB Storage Engine • Scalable, high-performance analytics platform • Built in redundancy and high availability • Runs on premise, on AWS cloud • Full SQL syntax and capabilities regardless of platform Big Data Sources Analytics Insight MariaDB ColumnStore . . . Node 1 Node 2 Node 3 Node N Local / AWS® / GlusterFS ® ELT Tools BI Tools
  • 5. MariaDB ColumnStore Architecture Columnar Distributed Data Storage User Connections User Module nUser Module 1 Performance Module n Performance Module 2 Performance Module 1 MariaDB Front End Query Engine User Module Processes SQL Requests Performance Module Distributed Processing Engine
  • 6. MariaDB ColumnStore High performance columnar storage engine that support wide variety of analytical use cases with SQL in a highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single SQL Interface for OLTP and analytics Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 7. OLTP/NoSQL Workloads Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows. OLAP/Analytic/ Reporting Workloads Workload – Query Vision/Scope 1 100 10,000 10-100GB 10,000,000,000 1-10TB 1,000,000 100,000,000 100-1,000GB
  • 8. MariaDB ColumnStore MariaDB Functions • MariaDB Client • MariaDB Connectivity (JDBC, ODBC) • MariaDB Security • Initial SQL Statement Parsing • Initial SQL Optimization < Custom Handler Class > • Execute final sort and final limit • Display final results ExeMgr Functions • SQL Optimization • Distribute work for scan, filter, join, functions, expressions, group by, aggregation, etc. to all available Performance Modules to be run in parallel • Collect the results returned by the Performance Modules • Return the final results to MySQL for display MariaDB ColumnStore ExeMgr Columnar Distributed Data Storage User Connections User Module nUser Module 1 Performance Module n Performance Module 2 Performance Module 1 MPP User Module Processes SQL Requests MySQL Front End Performance Module Executes the Queries Distributed Processing Engine
  • 9. MariaDB ColumnStore MariaDB ColumnStore uses standard “Engine=columnstore” syntax mysql> use tpcds_djoshi Database changed mysql> select count(*) from store_sales; +----------+ | count(*) | +----------+ | 2880404 | +----------+ 1 row in set (1.68 sec) mysql> describe warehouse; +-------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+--------------+------+-----+---------+-------+ | w_warehouse_sk | int(11) | NO | | NULL | | | w_warehouse_id | char(16) | NO | | NULL | | | w_warehouse_name | varchar(20) | YES | | NULL | | | w_warehouse_sq_ft | int(11) | YES | | NULL | | | w_street_number | char(10) | YES | | NULL | | | w_street_name | varchar(60) | YES | | NULL | | | w_street_type | char(15) | YES | | NULL | | | w_suite_number | char(10) | YES | | NULL | | | w_city | varchar(60) | YES | | NULL | | | w_county | varchar(30) | YES | | NULL | | | w_state | char(2) | YES | | NULL | | | w_zip | char(10) | YES | | NULL | | | w_country | varchar(20) | YES | | NULL | | | w_gmt_offset | decimal(5,2) | YES | | NULL | | +-------------------+--------------+------+-----+---------+-------+ 14 rows in set (0.05 sec) CREATE TABLE `game_warehouse`.`dim_title` ( `id` INT, `name` VARCHAR(45), `publisher` VARCHAR(45), `release_date` DATE, `language` INT, `platform_name` VARCHAR(45), `version` VARCHAR(45) ) ENGINE=columnstore; Uses custom scalable columnar architecture
  • 10. MariaDB ColumnStore mysql> use tpcds_djoshi Database changed mysql> select count(*) from store_sales; +----------+ | count(*) | +----------+ | 2880404 | +----------+ 1 row in set (1.68 sec) mysql> describe warehouse; +-------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------+--------------+------+-----+---------+-------+ | w_warehouse_sk | int(11) | NO | | NULL | | | w_warehouse_id | char(16) | NO | | NULL | | | w_warehouse_name | varchar(20) | YES | | NULL | | | w_warehouse_sq_ft | int(11) | YES | | NULL | | | w_street_number | char(10) | YES | | NULL | | | w_street_name | varchar(60) | YES | | NULL | | | w_street_type | char(15) | YES | | NULL | | | w_suite_number | char(10) | YES | | NULL | | | w_city | varchar(60) | YES | | NULL | | | w_county | varchar(30) | YES | | NULL | | | w_state | char(2) | YES | | NULL | | | w_zip | char(10) | YES | | NULL | | | w_country | varchar(20) | YES | | NULL | | | w_gmt_offset | decimal(5,2) | YES | | NULL | | +-------------------+--------------+------+-----+---------+-------+ 14 rows in set (0.05 sec) MariaDB Front End Standard ANSI SQL
  • 11. Process Functionality Value MariaDB • Hosts MariaDB • Connection management • SQL parsing & optimization Familiar DBMS interface Leverages existing partner integrations Delivers rich SQL syntax support Extent Map • Abstracts physical and logical storage • Metadata store Enables partition elimination ExeMgr • Work distribution • Final results management and aggregation Multi-threaded to take advantage of multi-core HW platforms User Module at a Glance
  • 12. Process Functionality Value PrimProc • Scale-out cache management • Distributed scan, filter, join and aggregation operations • Resource management Independent scalability and tunable performance Multi-threaded to take advantage of multi-core HW platforms Data • High Speed Bulk Load • Transactional DML and DDL • Online schema extensions Non-blocking read enabled Multi-threaded to take advantage of multi-core HW platforms Performance Module at a Glance
  • 13. Columnar General Best Practices Not suited for OLTP Micro-batch load allows for near real-time behavior Infrequently used columns do not impact other queries Columnar suitable for sparse columns (nulls compress nicely)
  • 14. Data Modeling Best Practices Star-schema optimizations are generally a good idea Conservative data typing is very important Especially around fixed-length vs. dictionary boundary (8 bytes) IP Address vs. IP Number Break down compound fields into individual fields: Trivializes searching for sub-fields Can avoid dictionary overhead Cost to re-assemble is generally small
  • 15. Compression with Data Storage Layer Blocks (8KB) Extent1 (8MB~64MB 8 million rows) Logical Layer Segment File1 (maps to an Extent) Physical Layer Compression Chunks
  • 16. Data Load and Extents (local load) 8 million rows 1st Data Load CSV File Data Range 1 ~ 200 Rows 16 million 2nd Data Load New CSV File Data Range 150 ~ 210 Rows 16 million +8 Data Load Data Load Extent 1 Min 1, Max 200 Extent 2 Min 1, Max 200 8 million rows 8 million rows Extent 3 Min 150, Max 210 Extent 4 Min 150, Max 210 8 million rows Extent 5 Min 150, Max 210 8 million rows
  • 17. Key meta-structure that powers MariaDB ColumnStore’s performance A catalog of all extents • Minimum and maximum values for a column’s data within an extent • Corresponding blocks for each extent Master copy of the Extent Map on primary PM node Upon system startup, copied to all other UM and PM nodes for disaster recovery and failover purposes Extent Map resident in memory for quick access at all nodes As extents modified, updates broadcasted to all participating nodes Stores about 64 bytes for each 8-64 Mbytes on disk Extent Map
  • 18. Extent Map When performing queries: • Eliminate the extents by taking into consideration only the extents for the column in join and filter conditions • Use the minimum and maximum value for the extents for join columns to filter the columns and eliminate extent Multiple columns can be used together for partition elimination Transitive properties apply, i.e. a filter on a dimension column (date, for example) can allow for partition elimination on fact table
  • 19. • 8-byte fixed length token (pointer). • A variable length value stored at the location identified by the pointer. Data Types 1-byte Field with 8192 values per 8k block 2-byte Field with 4096 values per 8k block 4-byte Field with 2048 values per 8k block 8-byte Field with 1024 values per 8k block Dictionary structure made up of 2 files/extents with: At the physical layer, all columns are stored as:
  • 20. • Varchar(8) or larger • Char(9) or larger Data Types 1-byte Field Examples TinyInt, Char(1) 2-byte Field Examples SmallInt, Char(2) 4-byte Field Examples Int, Char(3), Char(4), date, float 8-byte Field Examples BigInt, Char(5- 8),datetime, real/double Dictionary Examples At the physical layer, all columns are stored as:
  • 21. Sizing Minimum Spec UM 4 core, 32 G RAM PM 4 core, 16 G RAM Typical Server spec PM 8 core 64G RAM UM 8 core, 264G RAM Data Storage External Data Volumes • Maximum 2 data volume per IO channel per PM node server • up to 2TB on the disk per data volume ≈ Max 4 TB per PM node Local disk Up to 2TB on the disk per PM node server DETAILED SIZING GUIDE based on data size and workload
  • 22. Sizing - Example • MariaDB ColumnStore 60TB uncompressed data = 6TB compressed data at 10x compression • 2UM - 8 core 512G(based on work load) • 6 TB compressed = 3 data volume (at 2TB per volume) - with 1 data volume per PM node - 3PMs • Data growth - 2TB per month, Data retention - 2 years - Plan for 2TB X24 = 48 TB additional - 48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume) with 1 data volume per PM node - 3 additional PMs • Total 6 PMs, 2 UMs
  • 24. SQL Features Source : InfiniDB SQL Syntax Guide Cross Engine Joins UDF DML Aggregation DDL Disk Based Joins Windowing Functions SELECT QUERY
  • 25. MAX RANK MIN DENSE_RANK COUNT PERCENT_RANK SUM NTH_VALUE AVG FIRST_VALUE VARIANCE LAST_VALUE VAR_POP CUME_DIST VAR_SAMP LAG STD LEAD STDDEV NTILE STDDEV_POP PERCENTILE_CONT STDDEV_SAMP PERCENTILE_DISC ROW_NUMBER MEDIAN • Aggregate over a series of related rows • Simplified function for complex statistical analytics over sliding window per row - Cumulative, moving or centered aggregates - Simple Statistical functions like rank, max, min, average, median - More complex functions such as distribution, percentile, lag, lead - Without running complex sub-queries Windowing Functions Source : InfiniDB SQL Syntax Guide
  • 26. Top N Visitors for each Month Window Function Example Total for Each Visitor by Month Top 1 : Time_rank = 1 Top 2 : Time_rank <= 2 Top N : Time_rank <= N
  • 27. Complex Window Function Example Website Visitor Order Table Outlier Limits Quartile_1 = 1750 Quartile_3 = 2837.5 Median = 2650 Max_Val = 5000 , Min_Val = 300 Inter Quartile Range = Q3 – Q1 = 1087.5 Higher Control Limit = MIN(M – IQR*1.5, Max_Val) = 4281.25 Lower Control Limit = MAX(M + IQR*1.5, Min_Val) = 1018.75 Vistor_Id Order_Month Orders_Amount 1 January-2014 5,000 2 January-2014 1,000 3 January-2014 3,040 4 January-2014 2,000 5 January-2014 2,770 6 January-2014 2,750 7 January-2014 2,550 8 January-2014 300 1 February-2014 1,410 2 February-2014 293 10 February-2014 304 12 February-2014 314 *Discard Outlier visitors by spending for each month
  • 28. Tuning Commands mysql> select count(*) from part; +-----------+ | count(*) | +-----------+ | 200000000 | +-----------+ 1 row in set (0.48 sec) mysql> select calgetstats(); +------------------------------------------------------------------------------- | calgestats() -------------------------------------------------------------------------------- | Query Stats: MaxMemPct-0; NumTempFiles-0; TempFileSpace-0MB; PhyI/O-0; CacheI/ O-98039; +------------------------------------------------------------------------------- -------------------------------------------------------------------------------+ BlocksTouched-97658; CasPartBlks-0; MsgBytesIn-2MB; MsgBytesOut-0MB| 1242146662 640516 | -------------------------------------------------------------------------------+ Calgetstats: Information On The Last Query Executed Within A Given Session
  • 29. select 'BRAZIL', d_year, lo_tax, p_size, s_region, count(*) from dateinfo, part, supplier, lineorder where s_suppkey = lo_suppkey and d_datekey = lo_orderdate and p_partkey = lo_partkey and lo_orderdate between 19980101 and 19981231 and s_nation = 'BRAZIL' and p_size <> 23 group by 1,2,3,4,5 order by 1,2,3,4,5; mysql> select calgettrace(); Tuning Commands Calgetstats: Detailed distributed query execution plan
  • 30. Tuning Commands Query Statistics Users can view the query statistics by selecting the rows from the query stats table in the infinidb_querystats schema. Example 1 Example 2 List execution time, rows returned for all the select queries within the past 12 hours select queryid, query, endtime-starttime, rows from querystats where starttime >= now() - interval 12 hour and querytype = 'SELECT'; List the average, min and max running time of all the INSERT SELECT queries within the past 12 hours select min(endtime-starttime), max(endtime-starttime), avg(endtimestarttime) from querystats where querytype='INSERT SELECT' and starttime >= now() - interval 12 hour;
  • 31. calpont> getActiveSQLStatements getactivesqlstatements Wed Oct 7 08:38:32 2015 Get List of Active SQL Statements ================================= Start Time Time (hh:mm:ss) Session ID SQL Statement ---------------- ---------------- -------------------- -------------------------------------------------- Oct 7 08:38:30 00:00:03 73 select c_name, sum(lo_revenue) from customer, lineorder where lo_custkey = c_custkey and c_custkey = 6 group by c_name getActiveSQLStatements: List Active SQL Statements within the System mysql> show processlist; +----+------+-----------+-------+---------+------+-------+--------------+ | Id | User | Host | db | Command | Time | State | Info | +----+------+-----------+-------+---------+------+-------+--------------+ | 73 | root | localhost | ssb10 | Query | 0 | NULL | show processlist +----+------+-----------+-------+---------+------+-------+--------------+ 1 row in set (0.01 se Tuning Commands
  • 33. Bulk Data Load cpimport, LOAD DATA INFILE Bulk Data Export mysql client, odbc, jdbc Integration with MariaDB ColumnStore cpimport and sql interface
  • 34. Bulk Data Load: cpimport • Fastest way to load data into MariaDB ColumnStore • Load data from CSV file cpimport dbName tblName [loadFile] • Load data from Standard Input mysql -e 'select * from source_table;' -N db2 | cpimport destination_db destination_tbl -s 't‘ • Load data from Binary Source file cpimport -I1 mydb mytable sourcefile.bin • Multiple tables in can be loaded in parallel by launching multiple jobs • Read queries continue without being blocked • Successful cpimport is auto-committed • In case of errors, entire load is rolled back
  • 35. Bulk Data Load: cpimport mode 1 Single file Central Input : Data source at UM cpimport -m1 mytest mytable mytable.tbl cpimport Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node
  • 36. Bulk Data Load: cpimport mode 2 Distributed Input: Data Source at PMs Partitioned load file on each PM cpimport -m2 testdb mytable /home/mydata/mytable.tbl cpimport Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node Source Source
  • 37. Distributed Input: Data Source at PMs Partitioned load file on each PM cpimport -m2 testdb mytable /home/mydata/mytable.tbl Bulk load command at one or more PM cpimport –m3 testdb mytable /home/mydata/mytable.tbl Bulk Data Load: cpimport mode 3 Name Node UM Node Source Data Node PM Node Data Node PM Node Data Node PM Node Source Source cpimport cpimport cpimport
  • 38. Traditional way of importing data into any MariaDB storage engine table Bulk Data Load: LOAD DATA INFILE Up to 2 times slower than cpimport for large size imports mysql> load data infile '/tmp/ outfile1.txt' into table destinationTable; Query OK, 9765625 rows affected (2 min 20.01 sec) Records: 9765625 Deleted: 0 Skipped: 0 Warnings: 0 Either success or error operation can be rolled back
  • 39. • Connect with ODBC, JDBC or mysql client to the UM • Extract SQL query results in output file on the UM Bulk Data Export Distributed Export Central Export • Fastest way to do export • Use LOCAL PM query feature • Connect ODBC, JDBC or mysql client to each PM • Extract SQL query results in output file on each PM
  • 42. Star Schema Query – Example 2
  • 44. Data Warehousing Selective column based queries Large number of dimensions High Performance Analytics On Large Volume Of Data Reporting and analysis on millions or billions of rows From datasets containing millions to trillions of rows Terabytes to Petabytes of datasets Analytics Require Complex Joins, Windowing Functions Technical Use Cases
  • 45. Industry Category Use Case Gaming Behavior Analytics Projecting and predicting user behavior based on past and current data Advertising Customer Analytics Customer behavior data for market segmentation and predictive analytics. Advertising Loyalty Analytics Customer analytics focusing on a person’s commitment to a product, company, or brand. Web, E- commerce Click Stream Analytics Web activity analysis, software testing, market research with analytics on data about the clicks areas of web pages while web browsing [Deal News] Marketing Promotional Testing Using marketing and campaign management data to identify the best criteria to be used for a particular marketing offer. Social Network Network Analytics Relationship analytics among network nodes Financial Fraud Analytics Monitoring user financial transactions and identifying patterns of behaviour to predict and detect abnormal or fraudulent activity to prevent damage to user and institution. Healthcare Patient Analytics Analyzing patient medical records to identify patterns to be used for improved medical treatment. Healthcare Clinical Analytics Analyzing clinical data and its impact on patients to identify patterns to be used for improved medical treatment. Telco Network and Application Performance Analytics Streaming data from network devices and applications enriched with business operations data to uncover actionable insights for network planning, operations and marketing analytics Aviation Flight analytics Proactively project parts replacement, maintenance and air-plane retirement based on real-time and historically collected flight parameter data [Boeing] Customer Use Cases
  • 46. Thank you Serge Frezefond Cloud Solution Architect