MariaDB ColumnStore
Use Cases and Upcoming 1.1
features.
David Thompson
VP Engineering @ MariaDB
DB Tech Showcase Tokyo
September 7th 2017
What is MariaDB ColumnStore?
High performance columnar storage engine that supports a wide variety
of analytical use cases in highly scalable distributed environments
Parallel query
processing for distributed
environments
Faster, More
Efficient Queries
Single Interface for
OLTP and analytics
Easy to Manage and Scale
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open Source
to Big Data Analytics
Better Price
Performance
Better Price
Performance
Flexible deployment option
• Cloud and On-premise
• Run on commodity hardware
• Open Source, Subscription based pricing
No need to maintain a third platform
• Run analytics from the same SQL front end
• No need to update application code
• Leverage MariaDB Extensible architecture
High data compression
• More efficient at storing big data
• Less hardware
90.3%
less per TB
per year
Commercial Data
Warehouse
MariaDB
ColumnStore
Easier Enterprise
Analytics
ANSI SQL
Single SQL Front-end
• Use a single SQL interface for analytics and OLTP
• Leverage MariaDB Security features - Encryption for
data in motion , role based access and auditability
Full ANSI SQL
• No more SQL “like” query
• Support complex join, aggregation and window
function
Easy to manage and scale
• Eliminate needs for indexes and views
• Automated horizontal/vertical partitioning
• Linear scalable by adding new nodes as data grows
• Out of box connection with BI tools
Faster, More
Efficient Queries
Optimized for Columnar storage
• Columnar storage reduces disk I/O
• Blazing fast read-intensive workload
• Ultra fast data import
Parallel distributed query execution
• Distributed queries into series of parallel operations
• Fully parallel high speed data ingestion
Highly available analytic environment
• Built-in Redundancy
• Automatic fail-over
Parallel
Query Processing
MariaDB ColumnStore Architecture
• Massively parallel
architecture
– Linear scalability as
new nodes are added
• Horizontal scaling
– Add new data nodes
as your data grows
– Continue read queries
when adding new nodes
– Utilize MaxScale to load
balance and provide single
front end access point.
Shared-Nothing Distributed Data Storage
Compressed by default
User
Module
(UM)
Performance
Module
(PM)
Data Storage
MaxScaleMaxScale
Load
Balancer
ColumnStore Use Cases
MariaDB ColumnStore Use Cases
Financial Services Healthcare Telecommunications High Tech
Financial Services Industry
Industry Background
• Every customer interaction generates electronic records
• All transactions must be retained due to regulatory requirements
• Customer centric marketing became more important due to fierce competition
Why MariaDB ColumnStore
- Cost effective solution to archive all transactional data securely for regulatory compliance
- Fast data import from transactional database
- Easy to analyze the archived data with SQL based analytics
- Does not require DBA to index or partition data
Financial Services Industry
Use Cases
Regulatory Compliance
• Archive and retain historic transactional data
Fraud Detection
• Fraudulent or anomaly trade detection among millions of transactions per day
• Proactively identify risks and prevent billions of loss due to fraud
Trade Analytics
• Analyze 20-30 million quotes per day
• Identify trade patterns and predict the outcome
Healthcare / Life Science Industry
Industry Background
• Electronic Medical Record (EMR) usage is increasing 48% annually
• Increased adoption of big data for advanced research projects
• Data protection and privacy regulations
Why MariaDB ColumnStore
- Strong security features including role based data access and audit plug in
- MPP architecture handles analytics on big data with high speed
- Easy to analyze archived data with SQL based analytics
- Does not require DBA to index or partition data
Healthcare / Life Science Industry
Use Cases
Genome analysis
• In-depth genome research for the dairy industry to improve production of milk and protein.
• Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load)
Healthcare spending analysis
• Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data
• Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition
Viral disease analysis
• Used regional data with interactive map to identify Ebola disease spread
• The map displays not only the existing transmission of Ebola virus, but also the probability of occurrence
Visualization
IHME Visualizations library: http://www.healthdata.org/results/data-visualizations
Telecommunication Industry
Industry Background
• Extremely high digital traffic and bandwidth
• Complex service offerings (4G, 5G, Wifi, IoT)
• Customer centric / personalized service is critical due to low switching cost
• High churn rate
Why MariaDB ColumnStore
- ColumnStore support time based partitioning and time-series analysis
- Fast data load for real-time analytics
- MPP architecture handles analytics on big data with high speed
- Easy to analyze the archived data with SQL based analytics
Telecommunication Industry
Use Cases
Customer behavior analysis
• Analyze call data record to segment customers based on their behavior
• Data-driven analysis for customer satisfaction
• Create behavioral based upsell or cross-sell opportunity
Network optimization
• Combine network performance data with internal data (CDR)
• Proactive services before the service is interrupted
Call data analysis
• Data size: 6TB
• Ingest 1.5 million rows of logs per day with 30million texts and 3million calls
• Call and network quality analysis
• Provide higher quality customer services based on data
High tech Industry
Industry Background
• High pressure to improve product quality and yield through various techniques (Six Sigma, JIT, Lean etc)
• Explosion of data due to monitoring and sensor device innovations through IoT
Why MariaDB ColumnStore
- Identify patterns from massive dataset to improve yield
- MPP architecture handles analytics on big data with high speed
- Easy to analyze the archived data with SQL based analytics
- Does not require DBA to index or partition data
High tech Industry
Use Cases
Yield analysis and optimization
• Run simulation to test the semiconductor quality
• Chip designers utilize this test to improve the chip design and improve yield
• 3,000 tests run in parallel that generate 5 million to 30 million data points
Sensor Analytics
• Import data from multiple IoT sensors
• Run time series analysis to predict patterns and detect anomalies
• Correlate multiple sensor informations to predict machine failure
ColumnStore 1.1
ColumnStore 1.1
• After five 1.0.x maintenance releases bringing improved stability, 1.1 brings some
exciting new major features!
• Some new components will be under LGPL and BSL licensing. Core ColumnStore
engine and MariaDB server are GPL licensed.
• Release Timeline:
Q3 2017 Q4 2017
GA
(Q4)
Beta
(Mid September)
September October November December
ColumnStore 1.1 Features
Data Engine:
Streaming / API :
High Availability:
Analytics:
Data Types:
Ease of Use:
Performance:
Security:
Certifications:
Columnar Storage engine based on MariaDB Server 10.2
Bulk import API to support programmatic and streaming writes
Integrated GlusterFS support to provide storage HA for local disk
User Defined Aggregate / Window Functions
Text and Blob support
Backup and Restore Tool
Improved query and memory handling
Audit Plugin integration
Tableau certification
Data Streaming: ColumnStore Data API
What:
• C++ API to directly write to PM nodes
• LGPL licensed
• Per table write
• Input data is C++ data structure in API calls
• Can run remotely from UM and PM servers
Benefits:
● Real-time streaming directly into distributed data store
● No need to move large CSV data files to UM/PM
● Enable non-CSV data sources for columnstore
● Run outside UM/PM. Build custom ETL applications …
PM Node
Write
Engine
PM Node
Write
Engine
PM Node
Write
Engine
syslog
Data Sources
Data Streaming
Application
CS Data API
Library
ColumnStore Data Adapters 1.1
What ?
• Pre-packaged data adapters written using CS data API
• Convert from a specific data source into MariaDB
ColumnStore
• BSL licensed
Benefits
● Out of box real time data streaming into CS
● No need to move large CSV data files to UM/PM
● Enable non-CSV data sources for columnstore
● Run outside UM/PM. Build custom ETL applications
MaxScale CDC
Adapter
…
PM Node
Write
Engine
PM Node
Write
Engine
PM Node
Write
Engine
CS Data API
Library
MaxScale CDC
API
Avro Adapter
CS Data API
Library
Kafka Consumer
Interface
MaxScale
MDB Master
User Defined Distributed Aggregates
What
• Enables creation of user defined functions for aggregates and window functions. 1.0
supports only user defined scalar functions.
• Implemented using C++ SDK and allows map / reduce work breakdown between UM
and PM nodes.
Benefits
• Enables custom optimized analytical functions. For example:
– Sum of Squares ( Σ x2)
– Median (distributed)
What:
• Enables auto-configuration of GlusterFS as storage
filesystem for PM data.
• Guided option during install, allows specification of
data redundancy factor (2 or more) and automated
layout of data brick locations.
• If a PM node fails, then another node with a copy of the
data block takes over.
Benefits:
● Provide Data HA for on premise customers without network storage
appliances. (Or cloud providers with low performing networked
filesystems).
Built-in Data Redundancy for Local Storage
Data Block 1
Data Block 1
Copy
Data Block 1
Copy
Data Block 2 Data Block 3
Data Block 2
Copy
Data Block 3
Copy
Data Block 2
Copy
Data Block 3
Copy
PM 1 PM 2 PM 3
GlusterFS
UM
Where to find MariaDB ColumnStore?
SOFTWARE DOWNLOAD https://mariadb.com/downloads/mariadb-ax
SOURCE https://github.com/mariadb-corporation/mariadb-columnstore-engine
DOCUMENTATION https://mariadb.com/kb/en/mariadb/mariadb-columnstore/
BLOGS https://mariadb.com/blog-tags/columnstore
</>
Thank you

[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use cases and new features coming in 1.1 by MariaDB Corporation - David Thompson

  • 1.
    MariaDB ColumnStore Use Casesand Upcoming 1.1 features. David Thompson VP Engineering @ MariaDB DB Tech Showcase Tokyo September 7th 2017
  • 2.
    What is MariaDBColumnStore? High performance columnar storage engine that supports a wide variety of analytical use cases in highly scalable distributed environments Parallel query processing for distributed environments Faster, More Efficient Queries Single Interface for OLTP and analytics Easy to Manage and Scale Easier Enterprise Analytics Power of SQL and Freedom of Open Source to Big Data Analytics Better Price Performance
  • 3.
    Better Price Performance Flexible deploymentoption • Cloud and On-premise • Run on commodity hardware • Open Source, Subscription based pricing No need to maintain a third platform • Run analytics from the same SQL front end • No need to update application code • Leverage MariaDB Extensible architecture High data compression • More efficient at storing big data • Less hardware 90.3% less per TB per year Commercial Data Warehouse MariaDB ColumnStore
  • 4.
    Easier Enterprise Analytics ANSI SQL SingleSQL Front-end • Use a single SQL interface for analytics and OLTP • Leverage MariaDB Security features - Encryption for data in motion , role based access and auditability Full ANSI SQL • No more SQL “like” query • Support complex join, aggregation and window function Easy to manage and scale • Eliminate needs for indexes and views • Automated horizontal/vertical partitioning • Linear scalable by adding new nodes as data grows • Out of box connection with BI tools
  • 5.
    Faster, More Efficient Queries Optimizedfor Columnar storage • Columnar storage reduces disk I/O • Blazing fast read-intensive workload • Ultra fast data import Parallel distributed query execution • Distributed queries into series of parallel operations • Fully parallel high speed data ingestion Highly available analytic environment • Built-in Redundancy • Automatic fail-over Parallel Query Processing
  • 6.
    MariaDB ColumnStore Architecture •Massively parallel architecture – Linear scalability as new nodes are added • Horizontal scaling – Add new data nodes as your data grows – Continue read queries when adding new nodes – Utilize MaxScale to load balance and provide single front end access point. Shared-Nothing Distributed Data Storage Compressed by default User Module (UM) Performance Module (PM) Data Storage MaxScaleMaxScale Load Balancer
  • 7.
  • 8.
    MariaDB ColumnStore UseCases Financial Services Healthcare Telecommunications High Tech
  • 9.
    Financial Services Industry IndustryBackground • Every customer interaction generates electronic records • All transactions must be retained due to regulatory requirements • Customer centric marketing became more important due to fierce competition Why MariaDB ColumnStore - Cost effective solution to archive all transactional data securely for regulatory compliance - Fast data import from transactional database - Easy to analyze the archived data with SQL based analytics - Does not require DBA to index or partition data
  • 10.
    Financial Services Industry UseCases Regulatory Compliance • Archive and retain historic transactional data Fraud Detection • Fraudulent or anomaly trade detection among millions of transactions per day • Proactively identify risks and prevent billions of loss due to fraud Trade Analytics • Analyze 20-30 million quotes per day • Identify trade patterns and predict the outcome
  • 11.
    Healthcare / LifeScience Industry Industry Background • Electronic Medical Record (EMR) usage is increasing 48% annually • Increased adoption of big data for advanced research projects • Data protection and privacy regulations Why MariaDB ColumnStore - Strong security features including role based data access and audit plug in - MPP architecture handles analytics on big data with high speed - Easy to analyze archived data with SQL based analytics - Does not require DBA to index or partition data
  • 12.
    Healthcare / LifeScience Industry Use Cases Genome analysis • In-depth genome research for the dairy industry to improve production of milk and protein. • Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load) Healthcare spending analysis • Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data • Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition Viral disease analysis • Used regional data with interactive map to identify Ebola disease spread • The map displays not only the existing transmission of Ebola virus, but also the probability of occurrence
  • 13.
    Visualization IHME Visualizations library:http://www.healthdata.org/results/data-visualizations
  • 14.
    Telecommunication Industry Industry Background •Extremely high digital traffic and bandwidth • Complex service offerings (4G, 5G, Wifi, IoT) • Customer centric / personalized service is critical due to low switching cost • High churn rate Why MariaDB ColumnStore - ColumnStore support time based partitioning and time-series analysis - Fast data load for real-time analytics - MPP architecture handles analytics on big data with high speed - Easy to analyze the archived data with SQL based analytics
  • 15.
    Telecommunication Industry Use Cases Customerbehavior analysis • Analyze call data record to segment customers based on their behavior • Data-driven analysis for customer satisfaction • Create behavioral based upsell or cross-sell opportunity Network optimization • Combine network performance data with internal data (CDR) • Proactive services before the service is interrupted Call data analysis • Data size: 6TB • Ingest 1.5 million rows of logs per day with 30million texts and 3million calls • Call and network quality analysis • Provide higher quality customer services based on data
  • 16.
    High tech Industry IndustryBackground • High pressure to improve product quality and yield through various techniques (Six Sigma, JIT, Lean etc) • Explosion of data due to monitoring and sensor device innovations through IoT Why MariaDB ColumnStore - Identify patterns from massive dataset to improve yield - MPP architecture handles analytics on big data with high speed - Easy to analyze the archived data with SQL based analytics - Does not require DBA to index or partition data
  • 17.
    High tech Industry UseCases Yield analysis and optimization • Run simulation to test the semiconductor quality • Chip designers utilize this test to improve the chip design and improve yield • 3,000 tests run in parallel that generate 5 million to 30 million data points Sensor Analytics • Import data from multiple IoT sensors • Run time series analysis to predict patterns and detect anomalies • Correlate multiple sensor informations to predict machine failure
  • 18.
  • 19.
    ColumnStore 1.1 • Afterfive 1.0.x maintenance releases bringing improved stability, 1.1 brings some exciting new major features! • Some new components will be under LGPL and BSL licensing. Core ColumnStore engine and MariaDB server are GPL licensed. • Release Timeline: Q3 2017 Q4 2017 GA (Q4) Beta (Mid September) September October November December
  • 20.
    ColumnStore 1.1 Features DataEngine: Streaming / API : High Availability: Analytics: Data Types: Ease of Use: Performance: Security: Certifications: Columnar Storage engine based on MariaDB Server 10.2 Bulk import API to support programmatic and streaming writes Integrated GlusterFS support to provide storage HA for local disk User Defined Aggregate / Window Functions Text and Blob support Backup and Restore Tool Improved query and memory handling Audit Plugin integration Tableau certification
  • 21.
    Data Streaming: ColumnStoreData API What: • C++ API to directly write to PM nodes • LGPL licensed • Per table write • Input data is C++ data structure in API calls • Can run remotely from UM and PM servers Benefits: ● Real-time streaming directly into distributed data store ● No need to move large CSV data files to UM/PM ● Enable non-CSV data sources for columnstore ● Run outside UM/PM. Build custom ETL applications … PM Node Write Engine PM Node Write Engine PM Node Write Engine syslog Data Sources Data Streaming Application CS Data API Library
  • 22.
    ColumnStore Data Adapters1.1 What ? • Pre-packaged data adapters written using CS data API • Convert from a specific data source into MariaDB ColumnStore • BSL licensed Benefits ● Out of box real time data streaming into CS ● No need to move large CSV data files to UM/PM ● Enable non-CSV data sources for columnstore ● Run outside UM/PM. Build custom ETL applications MaxScale CDC Adapter … PM Node Write Engine PM Node Write Engine PM Node Write Engine CS Data API Library MaxScale CDC API Avro Adapter CS Data API Library Kafka Consumer Interface MaxScale MDB Master
  • 23.
    User Defined DistributedAggregates What • Enables creation of user defined functions for aggregates and window functions. 1.0 supports only user defined scalar functions. • Implemented using C++ SDK and allows map / reduce work breakdown between UM and PM nodes. Benefits • Enables custom optimized analytical functions. For example: – Sum of Squares ( Σ x2) – Median (distributed)
  • 24.
    What: • Enables auto-configurationof GlusterFS as storage filesystem for PM data. • Guided option during install, allows specification of data redundancy factor (2 or more) and automated layout of data brick locations. • If a PM node fails, then another node with a copy of the data block takes over. Benefits: ● Provide Data HA for on premise customers without network storage appliances. (Or cloud providers with low performing networked filesystems). Built-in Data Redundancy for Local Storage Data Block 1 Data Block 1 Copy Data Block 1 Copy Data Block 2 Data Block 3 Data Block 2 Copy Data Block 3 Copy Data Block 2 Copy Data Block 3 Copy PM 1 PM 2 PM 3 GlusterFS UM
  • 25.
    Where to findMariaDB ColumnStore? SOFTWARE DOWNLOAD https://mariadb.com/downloads/mariadb-ax SOURCE https://github.com/mariadb-corporation/mariadb-columnstore-engine DOCUMENTATION https://mariadb.com/kb/en/mariadb/mariadb-columnstore/ BLOGS https://mariadb.com/blog-tags/columnstore </>
  • 26.