MariaDB ColumnStore is a column-oriented storage engine for MariaDB that provides massively parallel processing for analytics workloads involving large datasets. It stores each column of data as a separate file for improved performance on analytics queries. ColumnStore uses a distributed architecture that allows queries to be processed in parallel across nodes, and scales linearly as new nodes are added. It provides faster analytics than row-oriented databases through its columnar format and compression.
3. OLTP/NoSQL
Workloads
• OLAP/Analytic/
Reporting Workloads
Suited for reporting or analysis of millions-billions of rows from data sets containing millions-trillions of rows.
Workload – Query Vision/Scope
1 100 10,000
10-100GB
10,000,000,000
1-10TB
1,000,000 100,000,000
100-1,000GB
4. 2. Diagnostic Analytics
Why did it Happen?
1. Descriptive Analytics
What is Happening?
Traditional OLAP
3. Predictive Analytics
What is likely to happen?
4. Prescriptive Analytics
What should I do about it?
Big Data Analytics
Social'Media'
Sensors'
Node'1''
Biometrics'
Mobile'
Data Collection
MariaDB
ColumnStore
Data Processing
BI Tools, Data Science
Applications
Connectors,
SPARK Integration etc
MariaDB
MaxScale
Transac5onal,'
Opera5onal'
Analytics
Insight
6. MariaDB ColumnStore
• GPLv2 Open Source
• Columnar, Massively Parallel
MariaDB Storage Engine
• Scalable, high-performance
analytics platform
• Built in redundancy and
high availability
• Runs on premise, on AWS cloud
• Full SQL syntax and capabilities
regardless of platform
Big Data Sources Analytics Insight
MariaDB ColumnStore
. . .
Node 1 Node 2 Node 3 Node N
Local / AWS® / GlusterFS
®
ELT
Tools
BI
Tools
8. Faster, More
Efficient Queries
Optimized for Columnar storage
• Columnar storage reduces disk I/O
• Blazing fast read-intensive workload
Parallel distributed query execution
• Distributed queries into series of parallel operations
• Fully parallel high speed data ingestion
Highly available analytic environment
• Built-in Redundancy
• Automatic fail-over
Parallel
Query Processing
9. Secure
MariaDB ColumnStore accesses all the same security capabilities delivered
in MariaDB Server. Secure ColumnStore data with encryption for data in
motion, role-based access and audit features.
Secure data in motion using out-of-the-box encryption
Tighten security with fine-grained role-based access
Easy auditability with a zero-downtime activation of query history
10. Performant
and scalable
MariaDB ColumnStore leverages the I/O benefits of columnar storage,
compression, just-in-time projection, and horizontal and vertical
partitioning to deliver tremendous performance when analyzing large data
sets.
Massive parallel processing for query intensive environments by distributed
scans, hash joins and aggregation
Support for fully parallel load capabilities
Faster time to insight with multi-threaded query execution and data
streaming at 500K writes per second
Runs as single or multi node system
Linear scalable by adding new nodes as data grows
no manual partitioning required
11. Costs Reducing
ColumnStore is a pluggable storage engine accessible from the same SQL
compliant interface as MariaDB Server. This radically simplifies data
management while reducing operating costs.
Save time writing queries by using MariaDB’s fully compatible SQL user
interface
Easily cross join tables from multiple storage engines for full insights
Support wide variety of BI tools with ANSI SQL support and language
drivers - any that uses ODBC/JDBC or MariaDB/MySQL connectors
12. Fast
MariaDB ColumnStore is a columnar storage engine for massively parallel
distributed query execution and data loading. It supports a vast spectrum of
analytic database use cases including real time, batch and algorithmic.
Increase performance with complex aggregation, joins and windowing
functions at the data storage level
Optimize for real-time and ad hoc database analytics
No indexes, no materialized views
Integrate with Spark for advanced analytics
13. Better Price
Performance
Flexible deployment option
• Cloud and On-premise
• Run on commodity hardware
• Open Source, Subscription based pricing
No need to maintain a third platform
• Run analytics from the same SQL front end
• No need to update application code
• Leverage MariaDB Extensible architecture
High data compression
• More efficient at storing big data
• Less hardware
90.3%
less per TB
per year
Commercial Data
Warehouse
MariaDB
ColumnStore
14. Technical Use Cases
Data Warehousing
Selective column
based queries
Large number
of dimensions
High Performance
Analytics On Large
Volume Of Data
Reporting and analysis
on millions or billions
of rows
From datasets
containing millions
to trillions of rows
Terabytes to Petabytes
of datasets
Analytics Require
Complex Joins,
Windowing Functions
16. MariaDB ColumnStore Architecture
• Massively parallel
architecture
– Linear scalability as
new nodes are added
• Horizontal scaling
– Add new data nodes
as your data grows
– Continue read queries
when adding new nodes
– Utilize MaxScale to load
balance and provide
single front end access
point.
Shared-Nothing Distributed Data Storage
Compressed by default (Local | SAN | EBS | Gluster FS )
User
Module
(UM)
Performance
Module
(PM)
Data Storage
Load
Balancer
(MaxScale)
17. User Modules Query Processing – UM
• Query is parsed by mysqld on UM node
• Parsed query handed over to ExeMgr on UM node
• ExecMgr breaks down the query in primitive operations
• SQL Operations are translated into thousands of
Primitives
• Parallel/Distributed SQL
• Extensible with Parallel/Distributed UDFs
• mysqld - The MariaDB
server
• ExeMgr - MariaDB’s
interface to ColumnStore
• cpimport - high-
performance data import
MariaDB SQL
Front End
User Module (UM)
18. Performance Module Query Processing – PM
• Primitives are processed on PM
• One thread working on a range of rows
• Typically 1/2 million rows, stored in a few hundred
blocks of data
• Execute all column operations required (restriction and
projection)
• Execute any group by/aggregation against local data
• Each primitive executes in a fraction of a second
• Primitives are run in parallel and fully distributed
• PrimProc - Primitives
Processor
• WriteEngineServ -
Database file writing
processor
• DMLProc - DML writes
processor
• DDLProc - DDL processors
Distributed
Query Engine
Performance
Module (PM)
19. Storage Architecture
• Columnar storage
– Each column stored as separate file
– No index management for query
performance tuning
– Online Schema changes: Add new column
without impacting running queries
• Automatic horizontal partitioning
– Logical partition every 8 Million rows
– In memory metadata of partition min and max
– No partition management for query
performance tuning
• Compression
– Accelerate decompression rate
– Reduce I/O for compressed blocks
Column 1
Extent 1 (8 million rows, 8MB 64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 Column 3 ... Column N
Data automatically arranged by
• Column – Acts as Vertical Partitioning
• Extents – Acts as horizontal partition
Vertical
Partition
Horizontal
Partition
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Horizontal
Partition
Horizontal
Partition
20. High Performance Data Ingestion
• Fully parallel high
speed data load
– Parallel data loads on all PMs simultaneously
– Multiple tables in can be loaded simultaneously
– Read queries continue without being blocked
• Micro-batch loading
for real-time data flow
Column 1
Extent 1 (8 million rows, 8MB 64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 ... Column N
Horizontal
Partition
...
Horizontal
Partition
Horizontal
Partition
High Water Mark
New Data being loaded
Dataaccessedby
runningqueries
22. Row oriented:
rows stored
sequentially in a file.
Column oriented:
Each column is stored
in a separate file. Each
column for a given
row is at the same
offset.
Row-oriented vs. Column-oriented format
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
23. Row oriented:
new rows appended to
the end.
Column oriented:
new value added to
each file.
Single-Row Operations - Insert
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches
row vs. column. Batch load on column-oriented is faster (compression, no indexes).
24. Row oriented:
new rows deleted
Column oriented:
value deleted from
each file
Single-Row Operations - Delete
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Recommended Partition Drop to allow dropping columns in bulk.
25. Row oriented:
Update 100% of rows
means change 100%
of blocks on disk.
Column oriented:
Just update the blocks
needed to be updated
Single-Row Operations - Update
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
26. Row oriented:
requires rebuilding of
the whole table
Column oriented:
Create new file for the
new column
Changing the table structure
Key Fname Lname State Zip Phone Age Sex Active
1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y
2 Yosemite Sam CA 95389 (209) 375-6572 52 M N
3 Daffy Duck NY 10013 (212) 227-1810 35 M N
4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y
5 Witch Hazel MA 01970 (978) 744-0991 57 F N
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
Active
Y
N
N
Y
N
Column-oriented is very flexible for adding columns, no need for a full rebuild
required with it.
27. Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O SELECT Fname FROM Table 1 WHERE State = ‘NY’
• Only touch column files
that are in projection, filter
and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
Min State: CA, Max State: NY
Extent 2:
Min State: OR, Max State: WY
Extent 3:
Min State: IA, Max State: TN
High Performance Query Processing
ID
1
2
3
4
...
8M
8M+1
...
16M
16M+1
...
24M
Fname
Bugs
Yosemite
Daffy
Hazel
...
...
Jane
...
Elmer
Lname
Bunny
Sam
Duck
Fudd
...
...
...
State
NY
CA
NY
ME
...
MN
WY
TX
OR
...
VA
TN
IA
NY
...
PA
Zip
11217
95389
10013
04578
...
...
...
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
...
...
...
Age
34
52
35
43
...
...
...
Sex
M
M
M
F
...
...
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Vertical
Partition
Vertical
Partition
…
ELIMINATED PARTITION
28. Analytics
Daily Running Average Revenue for each item
– In-database distributed analytics with complex
join, aggregation, window functions
– Extensible UDF for custom analytics
– Cross Engine Join with other storage engines
Window functions
– PARTITION BY / ORDER BY
– Aggregate functions: MAX, MIN, COUNT,
SUM, AVG STD, STDDEV_SAMP,
STDDEV_POP, VAR_SAMP, VAR_POP
– Ranking: ROW_NUMBER, RANK,
DENSE_RANK, PERCENT_RANK
CUME_DIST, NTILE, PERCENTILE,
PERCENTILE_CONT
PERCENTILE_DISC, MEDIAN
SELECT item_id, server_date, daily_revenue,
AVG(revenue) OVER
(PARTITION BY item_id ORDER BY server_date
RANGE INTERVAL '1' DAY PRECEDING ) running_avg
FROM web_item_sales
Item ID Server_date Revenue
1 02-01-2014 20,000.00
1 02-02-2014 5,001.00
2 02-01-2014 15,000.00
2 02-04-2014 34,029.00
2 02-05-2014 7,138.00
3 02-01-2014 17,250.00
3 02-03-2014 25,010.00
3 02-04-2014 21,034.00
3 02-05-2014 4,120.00
Running Average
20,000.00
12,500.50
15,000.00
34,209.00
20,583.50
17,250.00
250,100.00
12,577.00
20,583.50
30. Columnar General Best Practices
Not suited for OLTP
Micro-batch load allows for near real-time behavior
Infrequently used columns do not impact other queries
Columnar suitable for sparse columns (nulls compress nicely)
31. Data Modeling Best Practices
Star-schema optimizations are generally a good idea
Conservative data typing is very important
Especially around fixed-length vs. dictionary boundary (8 bytes)
IP Address vs. IP Number
Break down compound fields into individual fields:
Trivializes searching for sub-fields
Can avoid dictionary overhead
Cost to re-assemble is generally small
32. HA at UM node
• When one UM node goes down, another
UM node takes over
HA at PM node
• SAN/AWS EBS - When a PM node
goes down, the data volumes
attached to the failed PM node gets
attached to another PM
• Local Disks -If a PM node goes down,
the data on its disks are not available,
though queries continue on the
remaining data set
High Availability
HA at Data Storage
• AWS EBS (Elastic Block Store)
• GlusterFS - Multiple copy of data block
across storage. If a disk on a PM node
fails, another PM node will have access
to the copy of the data
35. Financial Services Industry
Industry Background
• Every customer interaction generates electronic records
• All transactions must be retained due to regulatory requirements
• Customer centric marketing became more important due to fierce competition
Why MariaDB ColumnStore
• Cost effective solution to archive all transactional data securely for regulatory compliance
• Fast data import from transactional database
• Easy to analyze the archived data with SQL based analytics
• Does not require DBA to index or partition data
36. Financial Services Industry
Use Cases
Regulatory Compliance
• Archive historic transactional data
• FINRA, Dodd Frank Act, SEC, SOX
Fraud Detection
• Fraudulent or anomaly trade detection among millions of transactions per day
• Proactively identify risks and prevent billions of loss due to fraud
Trade Analytics
• Analyze 20-30 million quotes per day
• Identify trade patterns and predict the outcome
37. Healthcare / Life Science Industry
Industry Background
• Electronic Medical Record (EMR) is increasing 48% annually
• Increase adoption of big data for advanced research projects
• Data protection and privacy regulations
Why MariaDB ColumnStore
• Strong security features like role based data access and audit plug in
• MPP architecture handles analytics on big data with high speed
• Easy to analyze the archived data with SQL based analytics
• Does not require DBA to index or partition data
38. Healthcare / Life Science Industry
Use Cases
Genome analysis
• In-depth genome research for the dairy industry to improve production of milk and protein.
• Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load)
Healthcare spending analysis
• Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data
• Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition
Viral disease analysis
• Used regional data with interactive map to identify Ebola disease spread
• The map displays not only the existing transmission of Ebola virus, but also the probability of occurrence
39. Telecommunication Industry
Industry Background
• Extremely high digital traffic and bandwidth
• Complex service offerings (4G, 5G, Wifi, IoT)
• Customer centric / personalized service is critical due to low switching cost
• High churn rate
Why MariaDB ColumnStore
• ColumnStore support time based partitioning and time-series analysis
• Fast data load for real-time analytics
• MPP architecture handles analytics on big data with high speed
• Easy to analyze the archived data with SQL based analytics
40. Telecommunication Industry
Use Cases
Customer behavior analysis
• Analyze call data record to segment customers based on their behavior
• Data-driven analysis for customer satisfaction
• Create behavioral based upsell or cross-sell opportunity
Network optimization
• Combine network performance data with internal data (CDR)
• Proactive services before the service is interrupted
Call data analysis
• Data size: 6TB
• Ingest 1.5 million rows of logs per day with 30million texts and 3million calls
• Call and network quality analysis
• Provide higher quality customer services based on data
41. High tech Industry
Industry Background
• High pressure to improve product quality and yield through various techniques (Six Sigma, JIT, Lean etc)
• Explosion of data due to monitoring and sensor device innovations through IoT
Why MariaDB ColumnStore
• Identify patterns from massive dataset to improve yield
• MPP architecture handles analytics on big data with high speed
• Easy to analyze the archived data with SQL based analytics
• Does not require DBA to index or partition data
42. High tech Industry
Use Cases
Yield analysis and optimization
• Run simulation to test the semiconductor quality
• Chip designers utilize this test to improve the chip design and improve yield
• 3,000 tests run in parallel that generate 5million to 30million data points
Sensor Analytics
• Import data from multiple IoT sensors
• Run time series analysis to predict patterns and detect anomalies
• Correlate multiple sensor informations to predict machine faliure
43. Industry Category Use Case
Gaming Behavior Analytics Projecting and predicting user behavior based on past and current data
Advertising Customer Analytics Customer behavior data for market segmentation and predictive analytics.
Advertising Loyalty Analytics Customer analytics focusing on a person’s commitment to a product, company, or brand.
Web, E-
commerce
Click Stream Analytics
Web activity analysis, software testing, market research with analytics on data about the clicks areas of web pages while
web browsing [Deal News]
Marketing Promotional Testing Using marketing and campaign management data to identify the best criteria to be used for a particular marketing offer.
Social Network Network Analytics Relationship analytics among network nodes
Financial Fraud Analytics
Monitoring user financial transactions and identifying patterns of behaviour to predict and detect abnormal or
fraudulent activity to prevent damage to user and institution.
Healthcare Patient Analytics Analyzing patient medical records to identify patterns to be used for improved medical treatment.
Healthcare Clinical Analytics Analyzing clinical data and its impact on patients to identify patterns to be used for improved medical treatment.
Telco
Network and Application
Performance Analytics
Streaming data from network devices and applications enriched with business operations data to uncover actionable
insights for network planning, operations and marketing analytics
Aviation Flight analytics
Proactively project parts replacement, maintenance and air-plane retirement based on real-time and historically
collected flight parameter data [Boeing]
Customer Use Cases
44. Where to find MariaDB ColumnStore?
SOFTWARE DOWNLOAD https://mariadb.com/downloads/columnstore
SOURCE https://github.com/mariadb-corporation/mariadb-columnstore-engine
DOCUMENTATION https://mariadb.com/kb/en/mariadb/mariadb-columnstore/
BLOGS https://mariadb.com/blog-tags/columnstore
</>