5. Diagnostics: Why did it happen
● Aggregates: aggregate measure over one or
more dimension
○ Find total sales
○ Top five product ranked by sales
● Roll-ups: Aggregate at different levels of
dimension hierarchy
○ given total sales by city, roll-up to get sales
by state
● Drill-down: Inverse of roll-ups
○ given total sales by state, drill-down to get
total by city
● Slicing and Dicing:
○ Equality and range selections on one or
more dimensions
6. Predictive: What is likely to happen
● Sales Prediction
○ Analyze data to identify trends, spot
weakness or determine conditions
among broader data sets for making
decisions about the future
● Targeted marketing
○ what is likelihood of a customer
buying a particular product based on
past buying behavior
8. Prescriptive: What is the best course of action?
Paradox of choices
With too many choices, which one is the best?
9. Data Analytics Use Cases
By industry
Finance
Identify trade patterns
Detect fraud and anomalies
Predict trading outcomes
Manufacturing
Simulations to improve design/yield
Detect production anomalies
Predict machine failures (sensor data)
Telecom
Behavioral analysis of customer calls
Network analysis (perf and reliability)
Healthcare
Find genetic profiles/matches
Analyze health vs spending
Predict viral outbreaks
11. OLTP or Transactional Workload
• OLTP applications
– have a read / write ratio of maybe 50/50
– Web apps / E-commerce have more reads, ending with maybe 90/10
– Single rows are selected, inserted, updated and deleted, one by one or in small groups
• OLTP data structure is
– a representation of the business or the applications
• An order reference a customer, and order item is linked to an order
– Sometimes individual aspects break the normal form, for performance reasons
– Transactions and ACID properties are required
12. The OLAP or Analytics Workload
• Deals with data from a high level perspective
• Contains structured, semi-structured and sometimes unstructured data
• Data structures are optimized for analytics use and performance
• Handles data in large groups of rows
– SELECTs data by date, customer location, product id etc.
– Dealing with individual data items is usually ineffective
• Analytics data
– Often comes from many different sources
– Are loaded in batch or streamed in, mainly just INSERTed
– Are sometimes purged, but most of the times not
• Queries are largely ad-hoc,
• Transactions and ACID requirements are relaxed
13. Analytics database requirements
• Fast access to large amounts of data
• Scalable as data grows over time
– Analytics requirements increasing
– Regulatory requirements
– New data sources are added
• Load performance must be fast, scalable and predictable
• Data loading should be very flexible due to the different sources of data
– Some data loaded in batch, other is streamed
• Query performance also need to be scalable
• Data compression is a requirement
– Data size constraints, as well as read performance from disk
14. B-TREE INDEXES
THE GOOD
B-TREE INDEXES
THE BAD• Well known technology
• Works with most types of data
• Scales reasonably well
• Really good for OLTP
transactional data
• Really bad for unbalanced data
• Index modifications can be really slow
• Index modifications are largely single
threaded
• Slows down with the amount of data
• Really not scalable with large amount
of data
15. In summary, what do we need
• Something that
– can compress data A LOT
– can be written to with fast and predictable performance
– doesn't necessarily support transactions
• performance is key
– can support analytics queries (ad hoc, aggregate)
– can scale as data grows
– can still have a level of high availability
• Something that works with analytics tools, (Tableau, Pehtaho, Microstrategy, etc.)
17. Existing Approaches
Limited real time
analytics
Slow releases of product
innovation
Expensive hardware and software
Data Warehouses
Hadoop / NoSQL
LIMITED SQL SUPPORT
DIFFICULT TO INSTALL/
MANAGE
LIMITED TALENT POOL
DATA LAKE W/ NO DATA
MANAGEMENT
Hard to use
Purpose Built rather than
predictive analytics
18. Row-oriented vs. Column-oriented format
• Row oriented
– Rows stored sequentially
in a file
– Scans through every
record row by row
• Column oriented:
– Each column is stored in a
separate file
– Scans only the relevant
columns
ID Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
ID
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
SELECT Fname FROM People WHERE State = 'NY'
19. Single-Row Operations - Insert
Row oriented:
new rows appended
to the end.
Column oriented:
new value added to
each file.
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Columnar insert not efficient for singleton insertions (OLTP). Batch loads touches row vs.
column. Batch load on column-oriented is faster (compression, no indexes).
20. Single-Row Operations - Update
Row oriented:
Update 100% of rows
means change 100%
of blocks on disk.
Column oriented:
Just update the
blocks needed to be
updated
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
21. Single-Row Operations - Delete
Row oriented:
new rows deleted
Column oriented:
value deleted from
each file
Recommended Partition Drop to allow dropping columns in bulk.
Key Fname Lname State Zip Phone Age Sex
1 Bugs Bunny NY 11217 (718) 938-3235 34 M
2 Yosemite Sam CA 95389 (209) 375-6572 52 M
3 Daffy Duck NY 10013 (212) 227-1810 35 M
4 Elmer Fudd ME 04578 (207) 882-7323 43 M
5 Witch Hazel MA 01970 (978) 744-0991 57 F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
6 Marvin Martian CA 91602 (818) 761-9964 26 M
22. Changing the table structure
Row oriented:
requires rebuilding of
the whole table
Column oriented:
Create new file for
the new column
Column-oriented is very flexible for adding columns, no need for a full rebuild
required with it.
Key Fname Lname State Zip Phone Age Sex Active
1 Bugs Bunny NY 11217 (718) 938-3235 34 M Y
2 Yosemite Sam CA 95389 (209) 375-6572 52 M N
3 Daffy Duck NY 10013 (212) 227-1810 35 M N
4 Elmer Fudd ME 04578 (207) 882-7323 43 M Y
5 Witch Hazel MA 01970 (978) 744-0991 57 F N
Key
1
2
3
4
5
Fname
Bugs
Yosemite
Daffy
Elmer
Witch
Lname
Bunny
Sam
Duck
Fudd
Hazel
State
NY
CA
NY
ME
MA
Zip
11217
95389
10013
04578
01970
Phone
(718) 938-3235
(209) 375-6572
(212) 227-1810
(207) 882-7323
(978) 744-0991
Age
34
52
35
43
57
Sex
M
M
M
M
F
Active
Y
N
N
Y
N
23. Storage Architecture
• Columnar storage
– Each column stored as separate file
– No index management for query
performance tuning
– Online Schema changes: Add new column
without impacting running queries
• Automatic horizontal partitioning
– Logical partition every 8 Million rows
– In memory metadata of partition min and max
– Query engine performs partition elimination.
– No partition management for query
performance tuning
• Compression
– Accelerate decompression rate
– Reduce I/O for compressed blocks
Column 1
Extent 1 (8 million rows, 8MB~64MB)
Extent 2 (8 million rows)
Extent M (8 million rows)
Column 2 Column 3 ... Column N
Data automatically arranged by
• Column – Acts as Vertical Partitioning
• Extents – Acts as horizontal partition
Vertical
Partition
Horizontal
Partition
...
Vertical
Partition
Vertical
Partition
Vertical
Partition
Horizontal
Partition
Horizontal
Partition
24. High Performance Query Processing
Horizontal
Partition:
8 Million Rows
Extent 2
Horizontal
Partition:
8 Million Rows
Extent 3
Horizontal
Partition:
8 Million Rows
Extent 1
Storage Architecture reduces I/O
• Only touch column files
that are in filter, projection,
group by, and join conditions
• Eliminate disk block touches
to partitions outside filter
and join conditions
Extent 1:
ShipDate: 2016-01-12 - 2016-03-05
Extent 2:
ShipDate: 2016-03-05 - 2016-09-23
Extent 3:
ShipDate: 2016-09-24 - 2017-01-06
SELECT Item, sum(Quantity) FROM Orders
WHERE ShipDate between ‘2016-01-01’ and ‘2016-01-31’
GROUP BY Item
Id OrderId Line Item Quantity Price Supplier ShipDate ShipMode
1 1 1 Laptop 5 1000 Dell
2016-01-1
2 G
2 1 2 Monitor 5 200 LG
2016-01-1
3 G
3 2 1 Mouse 1 20 Logitech 2016-02-05 M
4 3 1 Laptop 3 1600 Apple 2016-01-31 P
... ... ... ... ... ... ... ... ...
8M 2016-03-05
8M+1 2016-03-05
... ... ... ... ... ... ... ... ...
16M 2016-09-23
16M+1 2016-09-24
... ... ... ... ... ... ... ... ...
24M 2017-01-06
ELIMINATED PARTITION
ELIMINATED PARTITION
27. MariaDB ColumnStore
High performance columnar storage engine that supports a wide variety
of analytical use cases in highly scalable distributed environments
Parallel query
processing for distributed
environments
Faster, More
Efficient Queries
Single Interface for
OLTP and analytics
Easy to Manage and
Scale
Easier Enterprise
Analytics
Power of SQL and
Freedom of Open
Source to Big Data
Analytics
Better Price
Performance
28. MariaDB AX
MariaDB Server
MariaDB MaxScale
MariaDB ColumnStore
Parallel queries
Distributed storage
No indexes
Automatic partitioning
Read optimized
High compression
Low disk IO ColumnStore
Storage
ColumnStore
Storage
ColumnStore
Storage
MariaDB Server
ColumnStore
MariaDB Server
ColumnStore
MariaDB MaxScale
MariaDB Server
ColumnStore
ColumnStore
Storage
MariaDB MaxScale
UM
User
Module
PM
Performance Module
29. Better Price
Performance
No need to maintain a third platform
• Run analytics from the same SQL front end
• No need to update application code
• Leverage MariaDB Extensible architecture
Flexible deployment option
Cloud and On-premise
Run on commodity hardware
Open Source, Subscription based pricing
High data compression
• More efficient at storing big data
• Less hardware
Customers have saved by going to MariaDB AX against
Oracle(HealthCare), MemSQL(Auto-parts), Vertica(Finance, SEO
Marketing): Come see them at M18!
90.3%
less per TB
per year
Commercial Data
Warehouse
MariaDB
ColumnStore
30. Easier Enterprise
Analytics
Full ANSI SQL
• No more SQL “like” query
• Support complex join, aggregation and window
function
Single SQL Front-end
Use a single SQL interface for analytics and OLTP
Leverage MariaDB Security features - Encryption for data in motion,
role based access and auditing
Easy to manage and scale
• Eliminate needs for indexes and views
• Automated horizontal/vertical partitioning
• Linear scalable by adding new nodes as data grows
• Out of box connection with BI tools
ANSI SQL
31. Faster, More
Efficient Queries
Parallel distributed query execution
• Distributed queries into series of parallel operations
• Fully parallel high speed data ingestion
– TPCH lineitem table - 750K to 1 million rows per min
Optimized for Columnar storage
Columnar storage reduces disk I/O
Blazing fast read-intensive workload
Ultra fast data import
Highly available analytic environment
• Built-in Redundancy
• Automatic fail-over
Parallel
Query Processing
33. Healthcare / Life Science Industry
Genome analysis
• In-depth genome research for the dairy industry to improve production of milk and protein.
• Fast data load for large amount of genome dataset (DNA data for 7billion cows in US - 20GB per load)
Healthcare spending analysis
• Analyze 3TB of US health care spending for 155 conditions with 7 years of historical data
• Used sankey diagram, treemap, and pyramid chart to analyze trends by age, sex, type of care, and condition
Why MariaDB ColumnStore
• Strong security features including role based data access and audit plug in
• MPP architecture handles analytics on big data with high speed
• Easy to analyze archived data with SQL based analytics
• Does not require DBA to index or partition data
34. Telecommunication Industry
Customer behavior analysis
• Analyze call data record to segment customers based on their behavior
• Data-driven analysis for customer satisfaction
• Create behavioral based upsell or cross-sell opportunity
Call data analysis
• Data size: 6TB
• Ingest 1.5 million rows of logs per day with 30million texts and 3million calls
• Call and network quality analysis
• Provide higher quality customer services based on data
Why MariaDB ColumnStore
• ColumnStore support time based partitioning and time-series analysis
• Fast data load for real-time analytics
• MPP architecture handles analytics on big data with high speed
• Easy to analyze the archived data with SQL based analytics
35. In Conclusion
• Analytics require a different technology to be able to cope with
– Different types of data
– Different types of data access
• OLTP databases has different requirements compared to Analytics
• Column Based storage allows high compression
• Metadata can replace indexing
• Distributed processing allows for performance and scalability
• MariaDB ColumnStore implement a fast an efficient distributed database for
analytics
• MariaDB AX is the subscription for professional use of MariaDB ColumnStore
• MariaDB ColumnStore is gaining wide acceptance
38. IHME - Institute of Health Metrics and Evaluation
IHME Visualizations library: http://www.healthdata.org/results/data-visualizations
Started with 4.2 TB, with goal to go to 30TB of data
39. Customer Use Case -1
Industry: healthcare (Medicaid)
Data: surveys
Use case: decision support system
Details:
• Identify trends and patterns
• Determine population cohorts
• Predict health outcomes
• Anticipate funding / capacity
• Recommend intervention
Can’t do complex queries on current
hardware with Oracle and snowflake
schemas
Limited to optimizing for simple, known
queries (2-3 columns)
Replaced with ColumnStore
> a single table
> 2.5 million rows, 248 columns >
complex, ad-hoc queries
> query 20+ columns in seconds
40. Customer Use Case - 2
Industry: biotechnology (genetics)
Data: genotypes
Use case: genetic profiling
Details:
• Find genetic mates (beef and dairy)
• Predict meat production (pork)
• Gene/DNA analysis
Had to convert to CSV files and schedule
import jobs (cron)
Always receiving new genetic data
Migrated to data adapter (Python)
> streamline import process
> remove steps / possible error
> remove delays
> import data on demand
> immediate customer access
41. Customer Use Case - 3
Industry:Mobile text/call app
Data: call and text logs
Use case: Mobile app use analytics
Details:
• 30 million text and 3 million phone call
per day
• 1.5 billion rows of logs per day
• The text and call volume rate will continue
to grow
InnoDB backend hit the scale limit of
6TB and it requires lot of performance
tuning and index management
Migrated to MariaDB AX
> Able to process 24 month - 24TB vs
6 months limitation of InnoDB
> Same BI tools and client applications
worked with MariaDB AX seamlessly