M|18 Analytics in the Real World, Case Studies and Use Cases

Analytics in the Real World,
Case Studies and Use Cases
Amy Krishnamohan
Director of Product Marketing
GOTO Satoru
Customer Solutions Engineer

Finance
● Identify trade patterns
● Detect fraud and anomalies
● Predict trading outcomes
Manufacturing
● Simulations to improve design/yield
● Detect production anomalies
● Predict machine failures (sensor data)
Telecom
● Behavioral analysis of customer calls
● Network analysis (perf and reliability)
Healthcare
● Find genetic profiles/matches
● Analyze health vs spending
● Predict viral outbreaks
CIM Inc.
MariaDB AX Use Case

1. Find genetic
mates for cattle
2. Predict meat
production
3. Gene/DNA
analysis
Had to convert to CSV files and schedule
import jobs (cron)
Always receiving new genetic data
Migrated to data adapter (Python)
● streamline import process
● remove steps / possible error
● remove delays
● import data on demand
● immediate customer access
Life Science industry
Industry
biotechnology
(genetics)
Data
genotypes
Use Case
genetic profiling
Details

1. Identify trends
and patterns
2. Determine
population
cohorts
3. Predict health
outcomes
4. Anticipate
funding / capacity
5. Recommend
intervention
Can’t do complex queries on current hardware
with Oracle and snowflake schemas
Limited to optimizing for simple, known queries
(2-3 columns)
Replaced with ColumnStore
● a single table
● 2.5 million rows, 248 columns > complex,
ad-hoc queries
● query 20+ columns in seconds
Healthcare industry
Industry
healthcare
(Medicaid)
Data
surveys
Use Case
decision support
system
Details

1. Import log
2. Analyze customer
behavior
a. Website
click
b. Keyword
search
3. Optimize ad
performance
4. Manage dynamic
pricing based on
the KPI
Needs real-time analytics to optimize
advertisement
● fast data ingestion
● optimizes Ad performance
● A/B testing
● target ad by geography and demographic
provide automated monitoring,
● adjusts traffic based on real-time
performance manages dynamic pricing.
Advertisement industry
Industry
Digital
Advertisement
Data
Log
Use Case
Ad Analytics
Details

1. Collect asset
tracking data
2. Analyze and
monitor
a. Contract
b. Performance
3. Proactive service
Needs to ingest text type data and integration
with BI tool
● faster data ingestion
● Time series analysis with Window
function
● real-time asset monitoring with Tableau
● predictive asset maintenance
High tech industry
Industry
High tech
Data
Asset tracking
time series data
Use Case
Asset
Management
Details

1. Receive sensor data
from different parts
2. Real-time
monitoring
3. Analyze historical
data to uncover
machine failure
pattern
4. Predict machine
failure
5. Schedule proactive
maintenance
Need real time data ingestion
Needs integration with Spark to run Machine
Learning algorithm
● faster data ingestion
● leverage Spark ML
● real-time monitoring
● reduce production downtime
Manufacturing industry
Industry
Manufacturing
/Automobile
Data
Sensor data
Use Case
Predictive
Maintenance
Details

1. Collect asset
tracking data
2. Analyze and
monitor
a. Contract
b. Performance
3. Proactive service
Needs big data analytics solution to analyze
over 25 million quote records and 100,000
trading records per day
● archive large set of data to comply with
regulations
● provide self-service analytics to
sales/marketing team
● time series analysis with Window
function
Finance industry
Industry
Finance
Data
Trading records
Use Case
Trading analysis
Details

Time Series Data Analysis
with ColumnStore

Free currency historical data from HistData.com
•GBPUSD M1 (1 minute) historical data in 2016
http://www.histdata.com/download-free-forex-historical-data/?/ascii/1-min
ute-bar-quotes/gbpusd/2016
•download HISTDATA_COM_ASCII_GBPUSD_M1_2016.zip
11

Free GBPUSD historical data (2016)
•1st column: timestamp
•need to convert the format in order to fit with DATETIME data type

MariaDB ColumnStore Data Types
• INT types - range is 2 less from max unsigned or min unsigned
• CHAR†
- max 255 bytes
• VARCHAR†
- max 8000 bytes
• DECIMAL - max 18 digits
• DOUBLE/FLOAT
• DATETIME - no sub-seconds yyyy-mm-dd hh:mm:ss
• DATE
• BLOB/TEXT

Convert timestamp w/ Ruby script
id = 0
while line = gets
timestamp, open, high, low, close = line.split(";")
year, month, day, hour, minute, second =
timestamp.unpack("a4a2a2xa2a2a2")
id+= 1
print "#{id},#{year}-#{month}-#{day} #{hour}:#{minute},”
puts [open, high, low, close].join(“,”)
end

Converted CSV
1,2016-01-03 17:00,1.473350,1.473350,1.473290,1.473290
2,2016-01-03 17:01,1.473280,1.473360,1.473260,1.473350
3,2016-01-03 17:02,1.473350,1.473350,1.473290,1.473290
4,2016-01-03 17:03,1.473300,1.473330,1.473290,1.473320
5,2016-01-03 17:04,1.473320,1.473340,1.473320,1.473320
6,2016-01-03 17:05,1.473340,1.473370,1.473300,1.473320
7,2016-01-03 17:06,1.473320,1.473320,1.473310,1.473310
8,2016-01-03 17:07,1.473310,1.473310,1.473300,1.473310
9,2016-01-03 17:08,1.473310,1.474010,1.473300,1.474010
• DATETIME - no sub-seconds yyyy-mm-dd hh:mm:ss

CREATE DATABASE/TABLE
MariaDB [(none)]> create database forex;
MariaDB [(none)]> use forex;
MariaDB [forex]> CREATE TABLE gbpusd(
id int,
time datetime,
open double,
high double,
low double,
close double)
engine=columnstore default character set=utf8;

import CSV into ColumnStore using cpimport
# cpimport -s ',' forex gbpusd gbpusd2016.csv
Locale is : C
Column delimiter : ,
Using table OID 3163 as the default JOB ID
Input file(s) will be read from : /home/vagrant/histdata
Job description file :
/usr/local/mariadb/columnstore/data/bulk/tmpjob/3163_D20170624_T103843_S950145_Job_3163.xml
Log file for this job: /usr/local/mariadb/columnstore/data/bulk/log/Job_3163.log
2017-06-24 10:38:43 (29756) INFO : successfully loaded job file
2017-06-24 10:38:43 (29756) INFO : Job file loaded, run time for this step : 0.0321331 seconds
2017-06-24 10:38:43 (29756) INFO : PreProcessing check starts
2017-06-24 10:38:43 (29756) INFO : input data file /home/vagrant/histdata/gbpusd2016.csv
2017-06-24 10:38:43 (29756) INFO : PreProcessing check completed
2017-06-24 10:38:43 (29756) INFO : preProcess completed, run time for this step : 0.0329528 seconds
2017-06-24 10:38:43 (29756) INFO : No of Read Threads Spawned = 1
2017-06-24 10:38:43 (29756) INFO : No of Parse Threads Spawned = 3
2017-06-24 10:38:45 (29756) INFO : For table forex.gbpusd: 372,480 rows processed and 372480 rows inserted.
2017-06-24 10:38:46 (29756) INFO : Bulk load completed, total run time : 2.11976 seconds
DB table

if cpimport failed...
# cpimport forex gbpusd gbpusd2016.csv
Locale is : C
Using table OID 3163 as the default JOB ID
Input file(s) will be read from : /home/vagrant/histdata
Job description file : /usr/local/mariadb/columnstore/data/bulk/tmpjob/3163_D20170624_T104034_S269473_Job_3163.xml
Log file for this job: /usr/local/mariadb/columnstore/data/bulk/log/Job_3163.log
2017-06-24 10:40:34 (30209) INFO : successfully loaded job file
2017-06-24 10:40:34 (30209) INFO : Job file loaded, run time for this step : 0.0253589 seconds
2017-06-24 10:40:34 (30209) INFO : PreProcessing check starts
2017-06-24 10:40:34 (30209) INFO : input data file /home/vagrant/histdata/gbpusd2016.csv
2017-06-24 10:40:34 (30209) INFO : PreProcessing check completed
2017-06-24 10:40:34 (30209) INFO : preProcess completed, run time for this step : 0.065531 seconds
2017-06-24 10:40:34 (30209) INFO : No of Read Threads Spawned = 1
2017-06-24 10:40:34 (30209) INFO : No of Parse Threads Spawned = 3
2017-06-24 10:40:34 (30209) INFO : Number of rows with errors = 11. Row numbers with error reasons are listed in file
/home/vagrant/histdata/gbpusd2016.csv.Job_3163_30209.err
2017-06-24 10:40:34 (30209) INFO : Number of rows with errors = 11. Exact error rows are listed in file
/home/vagrant/histdata/gbpusd2016.csv.Job_3163_30209.bad
2017-06-24 10:40:34 (30209) ERR : Actual error row count(11) exceeds the max error rows(10) allowed for table forex.gbpusd [1451]
2017-06-24 10:40:34 (30209) CRIT : Bulkload Read (thread 0) Failed for Table forex.gbpusd. Terminating this job. [1451]
2017-06-24 10:40:34 (30209) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination
2017-06-24 10:40:34 (30209) INFO : Table forex.gbpusd (OID-3163) was not successfully loaded. Rolling back.
2017-06-24 10:40:34 (30209) INFO : Bulk load completed, total run time : 0.638649 seconds

verify your Job_xxxx_xxxxx.err
gbpusd2016.csv.Job_3163_30209.err :
Line number 1; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1

performance LOAD DATA LOCAL INFILE
# mcsmysql --local-infile=1 forex
Welcome to the MariaDB monitor. Commands end with ; or g.
Your MariaDB connection id is 38
Server version: 10.1.23-MariaDB Columnstore 1.0.9-1
Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others.
Type 'help;' or 'h' for help. Type 'c' to clear the current input statement.
MariaDB [forex]> LOAD DATA LOCAL INFILE 'gbpusd2016.csv' INTO TABLE gbpusd FIELDS
TERMINATED BY ',';
Query OK, 372480 rows affected (1.52 sec)
Records: 372480 Deleted: 0 Skipped: 0 Warnings: 0

Performance ColumnStore : cpimport
# cpimport -s ',' forex gbpusd gbpusd2016.csv
2017-06-24 10:38:45 (29756) INFO : For table forex.gbpusd: 372480
rows processed and 372480 rows inserted.
2017-06-24 10:38:46 (29756) INFO : Bulk load completed, total run
time : 2.11976 seconds
-s: field separator

cpimport
• 2 sec. for 372,480 rows
LOAD DATA LOCAL INFILE
• 372480 rows affected
(1.52 sec)
CSV import: cpimport vs. LOAD DATA LOCAL INFILE

Performance ColumnStore : INSERT INTO
INSERT INTO gbpusd_idb(id, time, open, high, low, close) VALUES('1',
'2016-01-03 17:00', '1.473350', '1.473350', '1.473290', '1.473290');
'2016-01-03 17:01', '1.473280', '1.473360', '1.473260', '1.473350');
'2016-01-03 17:02', '1.473350', '1.473350', '1.473290', '1.473290');
...
MariaDB [forex]> source gbpusd2016.sql
...
MariaDB [forex]> Bye
real 18m16.178s
user 0m28.330s
sys 0m23.551s

SQLPad - https://github.com/rickbergfalk/sqlpad

SQLPad installation / launch
# yum -y install npm (EPEL repository required)
# npm install sqlpad -g
$ sqlpad -ip 0.0.0.0 --port 3000
Launching server WITHOUT SSL
Welcome to SQLPad!. Visit http://localhost:3000 to get started

UK votes to leave
EU after dramatic
night divides nation
• https://www.theguardian.com/politics/2016/jun/24/br
itain-votes-for-brexit-eu-referendum-david-cameron
The value of the pound
swung wildly on currency
markets as initial
confidence among
investors expecting a
remain vote was dented
by some of the early
referendum results,
triggering falls of close
to 10% and its biggest
one-day fall ever.

simple query for time period before/after vote

MariaDB ColumnStore
Window Functions

Supported Window Functions
Function Description
AVG() The average of all input values.
COUNT() Number of input rows.
CUME_DIST() Calculates the cumulative distribution, or relative rank, of the current row to other rows in
the same partition. Number of peer or preceding rows / number of rows in partition.
DENSE_RANK() Ranks items in a group leaving no gaps in ranking sequence when there are ties.
FIRST_VALUE() The value evaluated at the row that is the first row of the window frame (counting from 1);
null if no such row.

Supported Window Functions (cont’d)
LAG() The value evaluated at the row that is offset rows before the current row within the
partition; if there is no such row, instead return default. Both offset and default are
evaluated with respect to the current row. If omitted, offset defaults to 1 and default to null.
LAG provides access to more than one row of a table at the same time without a self-join.
Given a series of rows returned from a query and a position of the cursor, LAG provides
access to a row at a given physical offset prior to that position.
LAST_VALUE() The value evaluated at the row that is the last row of the window frame (counting from 1);
null if no such row.
LEAD() Provides access to a row at a given physical offset beyond that position. Returns value
evaluated at the row that is offset rows after the current row within the partition; if there is
no such row, instead return default. Both offset and default are evaluated with respect to
the current row. If omitted, offset defaults to 1 and default to null.
MAX() Maximum value of expression across all input values.

MEDIAN() An inverse distribution function that assumes a continuous distribution model. It takes a
numeric or datetime value and returns the middle value or an interpolated value that
would be the middle value once the values are sorted. Nulls are ignored in the
calculation.
MIN() Minimum value of expression across all input values.
NTH_VALUE() The value evaluated at the row that is the nth row of the window frame (counting from
1); null if no such row.
NTILE() Divides an ordered data set into a number of buckets indicated by expr and assigns the
appropriate bucket number to each row. The buckets are numbered 1 through expr. The
expr value must resolve to a positive constant for each partition. Integer ranging from 1
to the argument value, dividing the partition as equally as possible.
PERCENT_RANK() relative rank of the current row: (rank - 1) / (total rows - 1).

PERCENTILE_CONT() An inverse distribution function that assumes a continuous distribution model. It
takes a percentile value and a sort specification, and returns an interpolated value
that would fall into that percentile value with respect to the sort specification. Nulls
are ignored in the calculation.
PERCENTILE_DISC() An inverse distribution function that assumes a discrete distribution model. It takes a
percentile value and a sort specification and returns an element from the set. Nulls
are ignored in the calculation.
RANK() rank of the current row with gaps; same as row_number of its first peer.
ROW_NUMBER() number of the current row within its partition, counting from 1
STDDEV()
STDDEV_POP()
Computes the population standard deviation and returns the square root of the
population variance.

STDDEV_SAMP() Computes the cumulative sample standard deviation and returns the square root of
the sample variance.
SUM() Sum of expression across all input values.
VARIANCE()
VAR_POP()
Population variance of the input values (square of the population standard
deviation).
VAR_SAMP() Sample variance of the input values (square of the sample standard deviation).

MariaDB ColumnStore
Aggregate Functions

MAX GBPUSD 23th - 25th June 2016

MIN GBPUSD 23th - 25th June 2016

Drop off rate GBPUSD 23th - 25th June 2016
-13% of drop off
in a few hours

Correlation GBPUSD - USDJPY @ Brexit
scatter plot
(normalized)
GBPUSD*100-130
USDJPY-110

SELECT
( AVG( gbpusd.close * usdjpy.close ) - AVG( gbpusd.close ) * AVG( usdjpy.close ) ) /
( STDDEV(gbpusd.close) * STDDEV(usdjpy.close) )
AS correlation_coefficient_population
FROM usdjpy
INNER JOIN gbpusd ON gbpusd.time = usdjpy.time
WHERE
gbpusd.time BETWEEN TIMESTAMP ( '2016-06-22' )
AND TIMESTAMP ( '2016-06-26' );
Pearson correlation coefficient

Scatter Plot
(normalized)
correlation coeff.
94.4 % :
highly correlated GBPUSD*100-130

Correlation GBPUSD - USDJPY 2016 (Jan. - Dec.)
Scatter Plot
(normalized)
correlation coeff.
36%: low correlation GBPUSD*100-130
USDJPY-110

Performance - ColumnStore vs. InnoDB
ColumnStore storage engine:
+------------------------------------+
| correlation_coefficient_population |
+------------------------------------+
| 0.9648375371071727 |
+------------------------------------+
1 row in set (0.43 sec)
> 1000 times faster than InnoDB
SELECT
(AVG(gbpusd.close*usdjpy.close) - AVG(gbpusd.close)*AVG(usdjpy.close)) /
(STDDEV(gbpusd.close) * STDDEV(usdjpy.close))
AS correlation_coefficient_population
FROM gbpusd
JOIN usdjpy ON gbpusd.time = usdjpy.time
WHERE gbpusd.time BETWEEN TIMESTAMP('2016-06-23') AND TIMESTAMP('2016-06-25');
InnoDB storage engine:
+------------------------------------+
| correlation_coefficient_population |
+------------------------------------+
| 0.964837537107134 |
+------------------------------------+
1 row in set (8 min 11.21 sec)

Moving Average w/ Window Functions

Moving Average GBPUSD
SELECT
time, close,
AVG(close) OVER (
ORDER BY time ASC
ROWS BETWEEN
6 PRECEDING AND
6 FOLLOWING ) AS MA13,
COUNT(close) OVER (
ORDER BY time ASC
ROWS BETWEEN
6 PRECEDING AND
6 FOLLOWING ) AS row_count
FROM gbpusd
WHERE time BETWEEN TIMESTAMP('2016-06-23') AND TIMESTAMP('2016-06-25');

Moving Average GBPUSD
AVG(close) OVER (
ORDER BY time ASC
ROWS BETWEEN
6 PRECEDING AND
6 FOLLOWING )
AS MA13
time close MA13 row
count
6/23/2016 00:00 1.4797 1.4797 7 preceding 6
6/23/2016 00:01 1.4798 1.4797 8 preceding 5
6/23/2016 00:02 1.4798 1.4796 9 preceding 4
6/23/2016 00:03 1.4797 1.4796 10 preceding 3
6/23/2016 00:04 1.4796 1.4796 11 preceding 2
6/23/2016 00:05 1.4796 1.4796 12 preceding 1
6/23/2016 00:06 1.4796 1.4796 13 current row
6/23/2016 00:07 1.4796 1.4796 13 following 1
6/23/2016 00:08 1.4796 1.4796 13 following 2
6/23/2016 00:09 1.4796 1.4796 13 following 3
6/23/2016 00:10 1.4796 1.4797 13 following 4
6/23/2016 00:11 1.4796 1.4797 13 following 5
6/23/2016 00:12 1.4797 1.4797 13 following 6

summary
• Free Forex time series history data analyzed with :
– Simple analytic queries(aggregate functions) w/
SQLPad
– Moving average using Window Function

MariaDB Partners w/ Global Visual Analytics Leader Tableau
https://mariadb.com/about-us/newsroom/press-releases/fastest-growing-open-source-database
-mariadb-partners-global
Fastest Growing Open Source Database MariaDB Partners With Global Visual
Analytics Leader Tableau
Combination of ubiquitous database and visual analytics technologies accelerates delivery of
business insights
MENLO PARK, Calif. and HELSINKI – December 12, 2017 – MariaDB® Corporation, the
company behind the fastest growing open source database, today announced Tableau Software,
the global leader in visual analytics, has certified MariaDB integration with Tableau’s business
intelligence (BI) and visual analytics platform. Bringing together the highly popular data
management products and the renowned visualization technologies means businesses globally
can confidently use these preferred solutions for reliable, fast, data-driven business decisions.

Analyzing Queries in ColumnStore

Analyzing Queries : select calGetStats();
https://mariadb.com/kb/en/library/analyzing-queries-in-columnstore/
MariaDB [forex]> select calGetStats();
Query Stats: MaxMemPct-1; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-0;
CacheI/O-6298; BlocksTouched-6298; PartitionBlocksEliminated-0; MsgBytesIn-4MB;
MsgBytesOut-11MB; Mode-Distributed

Analyzing Queries : select calGetStats();
• MaxMemPct - Peak memory utilization on the User Module, likely in support of a large (User Module)
based hash join operation.
• NumTempFiles - Report on any temporary files created in support of query operations larger than
available memory, typically for unusual join operations where the smaller table join cardinality exceeds
some configurable threshold.
• TempFileSpace - Report on space used by temporary files created in support of query operations larger
than available memory, typically for unusual join operations where the smaller table join cardinality
exceeds some configurable threshold.
• PhyI/O - Number of 8k blocks read from disk, SSD, or other persistent storage.
• CacheI/O - Approximate number of 8k blocks processed in memory, adjusted down by the number of
discrete PhyI/O calls required.
• BlocksTouched - Approximate number of 8k blocks processed in memory.
• PartitionBlocksEliminated - The number of block touches eliminated via the Extent Map elimination
behavior.
MsgBytesIn, MsgByteOut - Message size in MB sent between nodes in support of the query.

Analyzing Queries : calSetTrace(1); calGetTrace();
MariaDB [test]> calSetTrace(1);
MariaDB [test]> select c_name, sum(o_totalprice) from customer, orders where o_custkey =
c_custkey and c_custkey = 5 group by c_name;
+--------------------+-------------------+
| c_name | sum(o_totalprice) |
+--------------------+-------------------+
| Customer#000000005 | 684965.28 |
+--------------------+-------------------+
1 row in set, 1 warning (0.34 sec)
MariaDB [test]> select calGetTrace();
Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows
BPS PM customer 3024 (c_custkey,c_name) 0 43 36 0.006 1
BPS PM orders 3038 (o_custkey,o_totalprice) 0 766 0 0.032 3
HJS PM orders-customer 3038 - - - - ----- -
TAS UM - - - - - - 0.021 1

Desc – Operation being executed. Possible values:
● BPS - Batch Primitive Step : scanning or projecting the column blocks.
● CES - Cross Engine Step: Performing Cross engine join
● DSS - Dictionary Structure Step : a dictionary scan for a particular variable length string value.
● HJS - Hash Join Step : Performing a hash join between 2 tables
● HVS - Having Step: Performing the having clause on the result set
● SQS - Sub Query Step: Performaning a sub query
● TAS - Tuple Aggregation step : the process of receiving intermediate aggregation results at the UM from the PM
nodes.
● TNS - Tuple Annexation Step : Query result finishing, e.g. filling in constant columns, limit, order by and final
distinct cases.
● TUS = Tuple Union step : Performing a SQL union of 2 sub queries.
● TCS = Tuple Constant Step: Process Constant Value Columns
● WFS = Window Function Step: Performing a window function.

• Mode – Where the operation was performed: UM or PM
• Table – Table for which columns may be scanned/projected.
• TableOID – ObjectID for the table being scanned.
• ReferencedOIDs – ObjectIDs for the columns required by the query.
• PIO – Physical I/O (reads from storage) executed for the query.
• LIO – Logical I/O executed for the query, also known as Blocks Touched.
• PBE – Partition Blocks Eliminated identifies blocks eliminated by Extent Map min/max.
• Elapsed – Elapsed time for a give step.
• Rows – Intermediate rows returned

MariaDB ColumnStore Architecture
Columnar Distributed Data Storage
Local Storage | SAN | EBS | Gluster FS
BI Tool SQL Client Custom
Big Data App
Application
MariaDB SQL
Front End
Distributed
Query Engine
Data
Storage
User Module (UM)
Performance
Module (PM)

Row-oriented vs. Column-oriented format
•Row oriented
–Rows stored
sequentially in a file
–Scans through every
record row by row
•Column oriented:
–Each column is stored
in a separate file
–Scans the only
relevant column
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM Table 1 WHERE State = 'NY'
67

High Performance Query Processing
Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
• Only touch column files
that are in projection, filter
and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
Min State: CA, Max State: NY
Extent 2:
Min State: OR, Max State: WY
Extent 3:
Min State: IA, Max State: TN
SELECT Fname FROM Table 1 WHERE State = ‘NY’
ID
1
2
3
4
...
8M
8M+1
...
16M
16M+1
...
24M
Fname
Bugs
Yosemite
Daffy
Hazel
...
...
Jane
...
Elmer
Lname
Bunny
Sam
Duck
Fudd
...
...
...
State
NY
CA
NY
ME
...
MN
WY
TX
OR
...
VA
TN
IA
NY
...
PA
Zip
11217
95389
10013
04578
...
...
...
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
...
...
...
Age
34
52
35
43
...
...
...
Sex
M
M
M
F
...
...
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Vertical
Partition
Vertical
Partition
…
ELIMINATED PARTITION

Sizing
Minimum Spec
UM
4 core,
32 G RAM PM
4 core,
16 G RAM
Typical Server spec
PM
8 core 64G RAM
UM
8 core, 264G RAM
Data Storage
External Data Volumes
• Maximum 2 data volume per IO
channel per PM node server
• up to 2TB on the disk per data
volume ≈ Max 4 TB per PM node
Local disk
Up to 2TB on the disk per
PM node server
DETAILED SIZING GUIDE
based on data size
and workload

MariaDB ColumnStore Sizing - Example
• 60TB uncompressed data =
6TB compressed data at 10x compression
• 2UM - 8 core 512GB(based on workload)
• 6 TB compressed = 3 data volume (at 2TB per volume)
-with 1 data volume per PM node - 3PMs
• Data growth - 2TB per month, Data retention - 2 years
-Plan for 2TB X24 = 48 TB additional
-48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB
per volume)
with 1 data volume per PM node - 3 additional PMs
• Total 6 PMs, 2 UMs

using ColumnStore
via SSL/TLS connection

/usr/local/mariadb/columnstore/mysql/my.cnf
[client]
ssl-ca = /etc/pki/tls/mariadb/certs/ca-cert.pem
ssl-cert = /etc/pki/tls/mariadb/certs/client-cert.pem
ssl-key = /etc/pki/tls/mariadb/private/client-key.pem
[mysqld]
ssl-ca = /etc/pki/tls/mariadb/certs/ca-cert.pem
ssl-cert = /etc/pki/tls/mariadb/certs/server-cert.pem
ssl-key = /etc/pki/tls/mariadb/private/server-key.pem

status : SSL enabled
MariaDB [(none)]> status
--------------
/usr/local/mariadb/columnstore/mysql/bin/mysql Ver 15.1 Distrib 10.1.23-MariaDB, for Linux
(x86_64) using readline 5.1
Connection id: 5
Current database:
Current user: root@localhost
SSL: Cipher in use is DHE-RSA-AES256-GCM-SHA384
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server: MariaDB
Server version: 10.1.23-MariaDB Columnstore 1.0.9-1
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8
Conn. characterset: utf8
UNIX socket: /usr/local/mariadb/columnstore/mysql/lib/mysql/mysql.sock

M|18 Analytics in the Real World, Case Studies and Use Cases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to M|18 Analytics in the Real World, Case Studies and Use Cases

Similar to M|18 Analytics in the Real World, Case Studies and Use Cases (20)

More from MariaDB plc

More from MariaDB plc (20)

Recently uploaded

Recently uploaded (20)

M|18 Analytics in the Real World, Case Studies and Use Cases