Big Data Analytics with MariaDB ColumnStore

Big Data Analytics with
MariaDB ColumnStore

MariaDB Company Confidential
Why Analytics ?
• Get the most value of your data asset
• Faster Better decision making process
• Cost reduction
• New products and services

Type of Analytics
Descriptive:
What happened ?
Predictive: What
is likely to happen
?
Diagnostic: Why
did it happened ?
Prescriptive:
What should I do
about it?

Descriptive: What happened ?
● Reports
○ Sales Report
○ Expense summary
● Ad-hoc requests to analyst

Diagnostics: Why did it happen
● Aggregates: aggregate measure over one or
more dimension
○ Find total sales
○ Top five product ranked by sales
● Roll-ups: Aggregate at different levels of
dimension hierarchy
○ given total sales by city, roll-up to get sales
by state
● Drill-down: Inverse of roll-ups
○ given total sales by state, drill-down to get
total by city
● Slicing and Dicing:
○ Equality and range selections on one or
more dimensions

Predictive: What is likely to happen
● Sales Prediction
○ Analyze data to identify trends, spot
weakness or determine conditions
among broader data sets for making
decisions about the future
● Targeted marketing
○ what is likelihood of a customer
buying a particular product based on
past buying behavior

Big Data Analytics Use Cases
By industry
Finance
Identify trade patterns
Detect fraud and anomalies
Predict trading outcomes
Manufacturing
Simulations to improve design/yield
Detect production anomalies
Predict machine failures (sensor data)
Telecom
Behavioral analysis of customer calls
Network analysis (perf and reliability)
Healthcare
Find genetic profiles/matches
Analyze health vs spending
Predict viral oubreaks

What do you need for Big Data Analytics
• Real-time analytics
– High speed data ingestion
– High speed read queries
• Analytics
– Built in analytics
– Choice of BI tools
• Cost of deployment and use
– Hardware and Price/Performance ratio
– Large talent pool

Existing Approaches
Limited real time analytics
Slow releases of product innovation
Expensive hardware and software
Data Warehouses
Hadoop / NoSQL
LIMITED SQL
SUPPORT
DIFFICULT TO
INSTALL/MANAGE
LIMITED TALENT POOL
DATA LAKE W/ NO DATA
MANAGEMENT
Hard to use

MariaDB Big Data Solution
MariaDB AX
and
MariaDB ColumnStore

MariaDB AX
Analytics -
simple, fast, scalable…
and open source

MariaDB AX
MariaDB Server
MariaDB MaxScale
MariaDB ColumnStore
Parallel queries
Distributed storage
No indexes
Automatic partitioning
Read optimized
High compression
Low disk IO ColumnStore
Storage
ColumnStore
Storage
ColumnStore
Storage
MariaDB Server
ColumnStore
MariaDB Server
ColumnStore
MariaDB MaxScale
MariaDB Server
ColumnStore
ColumnStore
Storage
MariaDB MaxScale

MariaDB ColumnStore
• GPLv2 Open Source
• Columnar, Massively Parallel
MariaDB Storage Engine
• Scalable, high-performance
analytics platform
• Built in redundancy and
high availability
• Runs on premise, on AWS cloud
• Full SQL syntax and capabilities
regardless of platformBig Data Sources Analytics Insight
MariaDB ColumnStore
. . .
Node 1 Node 2 Node 3 Node N
Local / SAN/ Cloud / GlusterFS ®
ELT
Tools
BI
Tools
Latest GA Version: 1.1.2

MariaDB ColumnStore
High performance columnar storage engine that support wide variety of
analytical use cases with SQL in a highly scalable distributed environments
Parallel query
processing for
distributed
environments
Faster, More
Efficient Queries
Single SQL Interface
for OLTP and
analytics
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance

Why Columnar ?
• Row oriented
– Rows stored
sequentially in a file
– Scans through every
record row by row
• Column oriented:
– Each column is stored
in a separate file
– Scans the only
relevant column
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM Table 1 WHERE State = 'NY'

OLTP/NoSQL
Workloads
Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows.
OLAP/Analytic/
Reporting Workloads
Workload – Query Vision/Scope
1 100 10,000
10-100GB
10,000,000,000
1-10TB
1,000,000 100,000,000
100-1,000GB
InnoDB, MyRocks, MyISAM ColumnStore

Data Warehousing
Selective column
based queries
Large number of
dimensions
High Performance
Analytics On Large
Volume Of Data
Reporting and analysis
on millions or billions
of rows
From datasets
containing millions to
trillions of rows
Terabytes to Petabytes
of datasets
Analytics Require
Complex Joins,
Windowing Functions
Technical Use Cases

Financial
Services
Trade Analytics
• Analyze 20-30 million quotes per day
• Identify trade patterns and predict the outcome
Fraud Detection
• Fraudulent or anomaly trade detection among millions of transactions per day
• Proactively identify risks and prevent billions of loss due to fraud
Regulatory Compliance
• Archive historic transactional data
• FINRA, Dodd Frank Act, SEC, SOX

Health care /
Life Science
Genome analysis
• In-depth genome research for the dairy industry to improve production of milk and
protein.
• Fast data load for large amount of genome dataset (DNA data for 7billion cows in US -
20GB per load)
• SQL based analytics
Health care spending analysis
• Data size: 3TB
• Analyze US health care spending for 155 conditions with 7 years of
historical data
• Used sankey diagram, treemap, and pyramid chart to analyze trends by
age, sex, type of care, and condition
Viral disease analysis
• Used geospatial techniques with interactive map to identify Ebola disease
spread
• The map displays not only the existing transmission of Ebola virus, but also
the probability of occurence

Telecom
Customer behavior analysis
• Analyze call data record to segment customers based on their behavior
• Data-driven analysis for customer satisfaction
• Create behavioral based up-sell or cross-sell opportunity
Call data analysis
• Data size: 6TB
• Ingest 1.5 million rows of logs per day with 30million texts and 3million
calls
• Call and network quality analysis
• Provide higher quality customer services based on data

MariaDB ColumnStore Architecture
Columnar Distributed Data Storage
User Connections
User Module nUser Module 1
Performance
Module n
Performance
Module 2
Performance
Module 1
MariaDB
Front End
Query Engine
User Module
Processes SQL Requests
Performance Module
Distributed Processing Engine

Storage Architecture
Column 1
Extent 1 (8 million rows, 8MB～64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 Column 3 ... Column N
Data automatically arranged by
• Column – Acts as Vertical Partitioning
• Extents – Acts as horizontal partition
...
Table
Logical View
8 million rows
• Columnar storage
– Each column stored as separate file
– No index management for query
performance tuning
– Online Schema changes: Add new column
without impacting running queries
• Automatic horizontal partitioning
– Logical partition every 8 Million rows
– In memory metadata of partition min and max
– No partition management for query performance
tuning
• Compression
– Default ON
– Accelerate decompression rate
– Reduce I/O for compressed blocks

Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
• Only touch column files
that are in projection, filter
and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
Min State: CA, Max State: NY
Extent 2:
Min State: OR, Max State: WY
Extent 3:
Min State: IA, Max State: TN
SELECT Fname FROM Table 1 WHERE State = ‘NY’
High Performance Query Processing
ID
1
2
3
4
...
8M
8M+1
...
16M
16M+1
...
24M
Fname
Bugs
Yosemite
Daffy
Hazel
...
...
Jane
...
Elmer
Lname
Bunny
Sam
Duck
Fudd
...
...
...
State
NY
CA
NY
ME
...
MN
WY
TX
OR
...
VA
TN
IA
NY
...
PA
Zip
11217
95389
10013
04578
...
...
...
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
...
...
...
Age
34
52
35
43
...
...
...
Sex
M
M
M
F
...
...
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Vertical
Partition
Vertical
Partition
…
ELIMINATED PARTITION

SQL Features
Cross Engine
Joins
UDF
DML
Aggregation
DDL
Disk Based
Joins
Windowing
Functions
SELECT
QUERY

MAX RANK
MIN DENSE_RANK
COUNT PERCENT_RANK
SUM NTH_VALUE
AVG FIRST_VALUE
VARIANCE LAST_VALUE
VAR_POP CUME_DIST
VAR_SAMP LAG
STD LEAD
STDDEV NTILE
STDDEV_POP PERCENTILE_CONT
STDDEV_SAMP PERCENTILE_DISC
ROW_NUMBER MEDIAN
• Aggregate over a series of related rows
• Simplified function for complex statistical
analytics over sliding window per row
- Cumulative, moving or centered aggregates
- Simple Statistical functions like rank, max, min,
average, median
- More complex functions such as distribution,
percentile, lag, lead
- Without running complex sub-queries
Windowing Functions

Top N Visitors for each Month
Window Function Example
Total for Each
Visitor by Month
Top 1 :
Time_rank = 1
Top 2 :
Time_rank <= 2
Top N :
Time_rank <= N

High Performance Data Ingestion
• Fully parallel high speed data load
– Parallel data loads on all PMs simultaneously
– Multiple tables in can be loaded simultaneously
– Read queries continue without being blocked
• Micro-batch loading for real-time
data flow
Column 1
Extent 1 (8 million rows, 8MB～64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 ... Column N
Horizontal
Partition
...
Horizontal
Partition
Horizontal
Partition
High Water Mark
New Data being loaded
Dataaccessedby
runningqueries

Enterprise Grade
• Enterprise grade security
– SSL, role based access, auditability
• Flexibility of Platform
– Run on on-premise using commodity
Linux servers
– Run on AWS
• High Availability
– Automatic UM failover
– Automatic PM failover with distributed
data attachment across all PMs in SAN
and EBS environment
User Module
Performance Module
Columnar Distributed Data Storage

ColumnStore 1.1 Features
Data Engine:
Streaming / API :
High Availability:
Analytics:
Data Types:
Ease of Use:
Performance:
Security:
Certifications:
Columnar Storage engine based on MariaDB Server 10.2
Bulk import API to support programmatic and streaming writes
Integrated GlusterFS support to provide storage HA for local disk
User Defined Aggregate / Window Functions
Text and Blob support
Backup and Restore Tool
Improved query and memory handling (5% faster than 1.0)
Audit Plugin integration
Tableau certification

Data Streaming: ColumnStore Data API
What:
• C++ API to directly write to PM nodes
• Per table write
• Input data is C++ data structure in API calls
• Can run remotely from UM and PM servers
• Bindings for Python, Go, and Java in progress (and other
languages as long as supported by SWIG).
Benefits:
● Real-time streaming directly into distributed data store
● No need to move large CSV data files to UM/PM
● Enable non-CSV data sources for columnstore
● Run outside UM/PM. Build custom ETL applications
https://mariadb.com/kb/en/library/columnstore-bulk-write-
sdk/
…
PM Node
Write
Engine
PM Node
Write
Engine
PM Node
Write
Engine
syslog Data Sources
Data Streaming
Application
CS Data API
Library

ColumnStore Data Adapters 1.1
What ?
• Pre-packaged data adapters written using CS data API
• Convert from a specific data source into MariaDB
ColumnStore
Benefits
● Out of box real time data streaming into CS
● No need to move large CSV data files to UM/PM
● Enable non-CSV data sources for columnstore
● Run outside UM/PM. Build custom ETL applications
MaxScale CDC
Adapter
…
PM Node
Write
Engine
PM Node
Write
Engine
PM Node
Write
Engine
CS Data API
Library
MaxScale CDC
API
Avro Adapter
CS Data API
Library
Kafka Consumer
Interface
MaxScale
MDB Master

GlusterFS Volume
Replication
Data Redundancy
MariaDB Server
ColumnStore
MariaDB Server
ColumnStore
/dbroot1 /dbroot2 /dbroot2 /dbroot3 /dbroot3 /dbroot1
Replication
ColumnStore
Storage
(dbroot2)
ColumnStore
Storage
(dbroot3)
GlusterFS can replicate files
within a volume - HA without
the need for an expensive
SAN
ColumnStore storage nodes can
read other files within a volume
- simple, automatic
failover
ColumnStore
Storage
(dbroot1)

MariaDB AX
● MariaDB ColumnStore releases
● MariaDB database proxy, MaxScale
● MariaDB Connectors
● 24x7x365 support
● 30-minute emergency response time
● Mission-critical patching
● Guaranteed version support
● Management and monitoring tools
● Installers
Modern data warehousing solution for large scale analytics
MariaDB ColumnStore
MariaDB MaxScale
MariaDB Connectors

Getting Started
• https://mariadb.com/kb/en/mariadb-columnstore/
• https://mariadb.com/downloads/mariadb-ax

MariaDB ColumnStore 1.0
Data Engine ● Columnar Engine based on MariaDB 10.1
Scale
● Columnar, Massively Parallel
● Linear scalability with automatic data partitioning
● Data compression designed to accelerate decompression rate, reducing disk I/O
Performance
● High performance analytics
● Columnar optimized, massively parallel, distributed query processing on commodity servers
Data Ingestion ● High speed parallel data load and extract without blocking reads
Analytics
● In database analytics with complex joins, windowing functions
● ACID Compliant
● Extensible User Defined Functions (UDF) for custom analytics
● Out of box BI Tools connectivity, Analytics integration with R
Enterprise Grade
● Cross join tables between MariaDB and ColumnStore for full insight
● SSL support, Auditability, Role Based Access
● Built-in High availability for UM and PM
Ease of Use
● Automatic horizontal partitioning
● No index, views or manual partition tuning needed
● Online schema changes while read queries continue
● Deploy anywhere on premise or cloud

Big Data Analytics with MariaDB ColumnStore

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Big Data Analytics with MariaDB ColumnStore

Similar to Big Data Analytics with MariaDB ColumnStore (20)

More from MariaDB plc

More from MariaDB plc (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics with MariaDB ColumnStore