[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use cases and new features coming in 1.1 by MariaDB Corporation - David Thompson
MariaDB ColumnStore is the analytics engine for MariaDB. This talk will introduce the product, use cases, and also introduce the new features coming in the next major release 1.1.
Similar to [db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use cases and new features coming in 1.1 by MariaDB Corporation - David Thompson
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
Similar to [db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use cases and new features coming in 1.1 by MariaDB Corporation - David Thompson (20)
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use cases and new features coming in 1.1 by MariaDB Corporation - David Thompson
1. MariaDB ColumnStore
Use Cases and Upcoming 1.1
features.
David Thompson
VP Engineering @ MariaDB
DB Tech Showcase Tokyo
September 7th 2017
2. What is MariaDB ColumnStore?
High performance columnar storage engine that supports a wide variety
of analytical use cases in highly scalable distributed environments
Parallel query
processing for distributed
environments
Faster, More
Efficient Queries
Single Interface for
OLTP and analytics
Easy to Manage and Scale
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open Source
to Big Data Analytics
Better Price
Performance
3. Better Price
Performance
Flexible deployment option
• Cloud and On-premise
• Run on commodity hardware
• Open Source, Subscription based pricing
No need to maintain a third platform
• Run analytics from the same SQL front end
• No need to update application code
• Leverage MariaDB Extensible architecture
High data compression
• More efficient at storing big data
• Less hardware
90.3%
less per TB
per year
Commercial Data
Warehouse
MariaDB
ColumnStore
4. Easier Enterprise
Analytics
ANSI SQL
Single SQL Front-end
• Use a single SQL interface for analytics and OLTP
• Leverage MariaDB Security features - Encryption for
data in motion , role based access and auditability
Full ANSI SQL
• No more SQL “like” query
• Support complex join, aggregation and window
function
Easy to manage and scale
• Eliminate needs for indexes and views
• Automated horizontal/vertical partitioning
• Linear scalable by adding new nodes as data grows
• Out of box connection with BI tools
5. Faster, More
Efficient Queries
Optimized for Columnar storage
• Columnar storage reduces disk I/O
• Blazing fast read-intensive workload
• Ultra fast data import
Parallel distributed query execution
• Distributed queries into series of parallel operations
• Fully parallel high speed data ingestion
Highly available analytic environment
• Built-in Redundancy
• Automatic fail-over
Parallel
Query Processing
6. MariaDB ColumnStore Architecture
• Massively parallel
architecture
– Linear scalability as
new nodes are added
• Horizontal scaling
– Add new data nodes
as your data grows
– Continue read queries
when adding new nodes
– Utilize MaxScale to load
balance and provide single
front end access point.
Shared-Nothing Distributed Data Storage
Compressed by default
User
Module
(UM)
Performance
Module
(PM)
Data Storage
MaxScaleMaxScale
Load
Balancer
9. Financial Services Industry
Industry Background
• Every customer interaction generates electronic records
• All transactions must be retained due to regulatory requirements
• Customer centric marketing became more important due to fierce competition
Why MariaDB ColumnStore
- Cost effective solution to archive all transactional data securely for regulatory compliance
- Fast data import from transactional database
- Easy to analyze the archived data with SQL based analytics
- Does not require DBA to index or partition data
10. Financial Services Industry
Use Cases
Regulatory Compliance
• Archive and retain historic transactional data
Fraud Detection
• Fraudulent or anomaly trade detection among millions of transactions per day
• Proactively identify risks and prevent billions of loss due to fraud
Trade Analytics
• Analyze 20-30 million quotes per day
• Identify trade patterns and predict the outcome
11. Healthcare / Life Science Industry
Industry Background
• Electronic Medical Record (EMR) usage is increasing 48% annually
• Increased adoption of big data for advanced research projects
• Data protection and privacy regulations
Why MariaDB ColumnStore
- Strong security features including role based data access and audit plug in
- MPP architecture handles analytics on big data with high speed
- Easy to analyze archived data with SQL based analytics
- Does not require DBA to index or partition data
12. Healthcare / Life Science Industry
Use Cases
Genome analysis
• In-depth genome research for the dairy industry to improve production of milk and protein.
• Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load)
Healthcare spending analysis
• Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data
• Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition
Viral disease analysis
• Used regional data with interactive map to identify Ebola disease spread
• The map displays not only the existing transmission of Ebola virus, but also the probability of occurrence
14. Telecommunication Industry
Industry Background
• Extremely high digital traffic and bandwidth
• Complex service offerings (4G, 5G, Wifi, IoT)
• Customer centric / personalized service is critical due to low switching cost
• High churn rate
Why MariaDB ColumnStore
- ColumnStore support time based partitioning and time-series analysis
- Fast data load for real-time analytics
- MPP architecture handles analytics on big data with high speed
- Easy to analyze the archived data with SQL based analytics
15. Telecommunication Industry
Use Cases
Customer behavior analysis
• Analyze call data record to segment customers based on their behavior
• Data-driven analysis for customer satisfaction
• Create behavioral based upsell or cross-sell opportunity
Network optimization
• Combine network performance data with internal data (CDR)
• Proactive services before the service is interrupted
Call data analysis
• Data size: 6TB
• Ingest 1.5 million rows of logs per day with 30million texts and 3million calls
• Call and network quality analysis
• Provide higher quality customer services based on data
16. High tech Industry
Industry Background
• High pressure to improve product quality and yield through various techniques (Six Sigma, JIT, Lean etc)
• Explosion of data due to monitoring and sensor device innovations through IoT
Why MariaDB ColumnStore
- Identify patterns from massive dataset to improve yield
- MPP architecture handles analytics on big data with high speed
- Easy to analyze the archived data with SQL based analytics
- Does not require DBA to index or partition data
17. High tech Industry
Use Cases
Yield analysis and optimization
• Run simulation to test the semiconductor quality
• Chip designers utilize this test to improve the chip design and improve yield
• 3,000 tests run in parallel that generate 5 million to 30 million data points
Sensor Analytics
• Import data from multiple IoT sensors
• Run time series analysis to predict patterns and detect anomalies
• Correlate multiple sensor informations to predict machine failure
19. ColumnStore 1.1
• After five 1.0.x maintenance releases bringing improved stability, 1.1 brings some
exciting new major features!
• Some new components will be under LGPL and BSL licensing. Core ColumnStore
engine and MariaDB server are GPL licensed.
• Release Timeline:
Q3 2017 Q4 2017
GA
(Q4)
Beta
(Mid September)
September October November December
20. ColumnStore 1.1 Features
Data Engine:
Streaming / API :
High Availability:
Analytics:
Data Types:
Ease of Use:
Performance:
Security:
Certifications:
Columnar Storage engine based on MariaDB Server 10.2
Bulk import API to support programmatic and streaming writes
Integrated GlusterFS support to provide storage HA for local disk
User Defined Aggregate / Window Functions
Text and Blob support
Backup and Restore Tool
Improved query and memory handling
Audit Plugin integration
Tableau certification
21. Data Streaming: ColumnStore Data API
What:
• C++ API to directly write to PM nodes
• LGPL licensed
• Per table write
• Input data is C++ data structure in API calls
• Can run remotely from UM and PM servers
Benefits:
● Real-time streaming directly into distributed data store
● No need to move large CSV data files to UM/PM
● Enable non-CSV data sources for columnstore
● Run outside UM/PM. Build custom ETL applications …
PM Node
Write
Engine
PM Node
Write
Engine
PM Node
Write
Engine
syslog
Data Sources
Data Streaming
Application
CS Data API
Library
22. ColumnStore Data Adapters 1.1
What ?
• Pre-packaged data adapters written using CS data API
• Convert from a specific data source into MariaDB
ColumnStore
• BSL licensed
Benefits
● Out of box real time data streaming into CS
● No need to move large CSV data files to UM/PM
● Enable non-CSV data sources for columnstore
● Run outside UM/PM. Build custom ETL applications
MaxScale CDC
Adapter
…
PM Node
Write
Engine
PM Node
Write
Engine
PM Node
Write
Engine
CS Data API
Library
MaxScale CDC
API
Avro Adapter
CS Data API
Library
Kafka Consumer
Interface
MaxScale
MDB Master
23. User Defined Distributed Aggregates
What
• Enables creation of user defined functions for aggregates and window functions. 1.0
supports only user defined scalar functions.
• Implemented using C++ SDK and allows map / reduce work breakdown between UM
and PM nodes.
Benefits
• Enables custom optimized analytical functions. For example:
– Sum of Squares ( Σ x2)
– Median (distributed)
24. What:
• Enables auto-configuration of GlusterFS as storage
filesystem for PM data.
• Guided option during install, allows specification of
data redundancy factor (2 or more) and automated
layout of data brick locations.
• If a PM node fails, then another node with a copy of the
data block takes over.
Benefits:
● Provide Data HA for on premise customers without network storage
appliances. (Or cloud providers with low performing networked
filesystems).
Built-in Data Redundancy for Local Storage
Data Block 1
Data Block 1
Copy
Data Block 1
Copy
Data Block 2 Data Block 3
Data Block 2
Copy
Data Block 3
Copy
Data Block 2
Copy
Data Block 3
Copy
PM 1 PM 2 PM 3
GlusterFS
UM
25. Where to find MariaDB ColumnStore?
SOFTWARE DOWNLOAD https://mariadb.com/downloads/mariadb-ax
SOURCE https://github.com/mariadb-corporation/mariadb-columnstore-engine
DOCUMENTATION https://mariadb.com/kb/en/mariadb/mariadb-columnstore/
BLOGS https://mariadb.com/blog-tags/columnstore
</>