Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

M|18 Analytics in the Real World, Case Studies and Use Cases

92 views

Published on

M|18 Analytics in the Real World, Case Studies and Use Cases

Published in: Data & Analytics
  • Be the first to comment

M|18 Analytics in the Real World, Case Studies and Use Cases

  1. 1. Analytics in the Real World, Case Studies and Use Cases Amy Krishnamohan Director of Product Marketing GOTO Satoru Customer Solutions Engineer
  2. 2. Finance ● Identify trade patterns ● Detect fraud and anomalies ● Predict trading outcomes Manufacturing ● Simulations to improve design/yield ● Detect production anomalies ● Predict machine failures (sensor data) Telecom ● Behavioral analysis of customer calls ● Network analysis (perf and reliability) Healthcare ● Find genetic profiles/matches ● Analyze health vs spending ● Predict viral outbreaks CIM Inc. MariaDB AX Use Case
  3. 3. 1. Find genetic mates for cattle 2. Predict meat production 3. Gene/DNA analysis Had to convert to CSV files and schedule import jobs (cron) Always receiving new genetic data Migrated to data adapter (Python) ● streamline import process ● remove steps / possible error ● remove delays ● import data on demand ● immediate customer access Life Science industry Industry biotechnology (genetics) Data genotypes Use Case genetic profiling Details
  4. 4. 1. Identify trends and patterns 2. Determine population cohorts 3. Predict health outcomes 4. Anticipate funding / capacity 5. Recommend intervention Can’t do complex queries on current hardware with Oracle and snowflake schemas Limited to optimizing for simple, known queries (2-3 columns) Replaced with ColumnStore ● a single table ● 2.5 million rows, 248 columns > complex, ad-hoc queries ● query 20+ columns in seconds Healthcare industry Industry healthcare (Medicaid) Data surveys Use Case decision support system Details
  5. 5. 1. Import log 2. Analyze customer behavior a. Website click b. Keyword search 3. Optimize ad performance 4. Manage dynamic pricing based on the KPI Needs real-time analytics to optimize advertisement Replaced with ColumnStore ● fast data ingestion ● optimizes Ad performance ● A/B testing ● target ad by geography and demographic provide automated monitoring, ● adjusts traffic based on real-time performance manages dynamic pricing. Advertisement industry Industry Digital Advertisement Data Log Use Case Ad Analytics Details
  6. 6. 1. Collect asset tracking data 2. Analyze and monitor a. Contract b. Performance 3. Proactive service Needs to ingest text type data and integration with BI tool Replaced with ColumnStore ● faster data ingestion ● Time series analysis with Window function ● real-time asset monitoring with Tableau ● predictive asset maintenance High tech industry Industry High tech Data Asset tracking time series data Use Case Asset Management Details
  7. 7. 1. Receive sensor data from different parts 2. Real-time monitoring 3. Analyze historical data to uncover machine failure pattern 4. Predict machine failure 5. Schedule proactive maintenance Need real time data ingestion Needs integration with Spark to run Machine Learning algorithm Replaced with ColumnStore ● faster data ingestion ● leverage Spark ML ● real-time monitoring ● reduce production downtime Manufacturing industry Industry Manufacturing /Automobile Data Sensor data Use Case Predictive Maintenance Details
  8. 8. 1. Collect asset tracking data 2. Analyze and monitor a. Contract b. Performance 3. Proactive service Needs big data analytics solution to analyze over 25 million quote records and 100,000 trading records per day Replaced with ColumnStore ● archive large set of data to comply with regulations ● provide self-service analytics to sales/marketing team ● time series analysis with Window function Finance industry Industry Finance Data Trading records Use Case Trading analysis Details
  9. 9. Time Series Data Analysis with ColumnStore
  10. 10. Forex historical data
  11. 11. Free currency historical data from HistData.com •GBPUSD M1 (1 minute) historical data in 2016 http://www.histdata.com/download-free-forex-historical-data/?/ascii/1-min ute-bar-quotes/gbpusd/2016 •download HISTDATA_COM_ASCII_GBPUSD_M1_2016.zip 11
  12. 12. Free GBPUSD historical data (2016) •1st column: timestamp •need to convert the format in order to fit with DATETIME data type
  13. 13. MariaDB ColumnStore Data Types • INT types - range is 2 less from max unsigned or min unsigned • CHAR† - max 255 bytes • VARCHAR† - max 8000 bytes • DECIMAL - max 18 digits • DOUBLE/FLOAT • DATETIME - no sub-seconds yyyy-mm-dd hh:mm:ss • DATE • BLOB/TEXT
  14. 14. Convert timestamp w/ Ruby script id = 0 while line = gets timestamp, open, high, low, close = line.split(";") year, month, day, hour, minute, second = timestamp.unpack("a4a2a2xa2a2a2") id+= 1 print "#{id},#{year}-#{month}-#{day} #{hour}:#{minute},” puts [open, high, low, close].join(“,”) end
  15. 15. Converted CSV 1,2016-01-03 17:00,1.473350,1.473350,1.473290,1.473290 2,2016-01-03 17:01,1.473280,1.473360,1.473260,1.473350 3,2016-01-03 17:02,1.473350,1.473350,1.473290,1.473290 4,2016-01-03 17:03,1.473300,1.473330,1.473290,1.473320 5,2016-01-03 17:04,1.473320,1.473340,1.473320,1.473320 6,2016-01-03 17:05,1.473340,1.473370,1.473300,1.473320 7,2016-01-03 17:06,1.473320,1.473320,1.473310,1.473310 8,2016-01-03 17:07,1.473310,1.473310,1.473300,1.473310 9,2016-01-03 17:08,1.473310,1.474010,1.473300,1.474010 • DATETIME - no sub-seconds yyyy-mm-dd hh:mm:ss
  16. 16. Populate sample data
  17. 17. CREATE DATABASE/TABLE MariaDB [(none)]> create database forex; MariaDB [(none)]> use forex; MariaDB [forex]> CREATE TABLE gbpusd( id int, time datetime, open double, high double, low double, close double) engine=columnstore default character set=utf8;
  18. 18. DESC TABLE MariaDB [forex]> desc gbpusd; +-----------+----------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------+----------+------+-----+---------+-------+ | id | int(11) | YES | | NULL | | | time | datetime | YES | | NULL | | | open | double | YES | | NULL | | | high | double | YES | | NULL | | | low | double | YES | | NULL | | | close | double | YES | | NULL | | +-----------+----------+------+-----+---------+-------+
  19. 19. import CSV into ColumnStore using cpimport # cpimport -s ',' forex gbpusd gbpusd2016.csv Locale is : C Column delimiter : , Using table OID 3163 as the default JOB ID Input file(s) will be read from : /home/vagrant/histdata Job description file : /usr/local/mariadb/columnstore/data/bulk/tmpjob/3163_D20170624_T103843_S950145_Job_3163.xml Log file for this job: /usr/local/mariadb/columnstore/data/bulk/log/Job_3163.log 2017-06-24 10:38:43 (29756) INFO : successfully loaded job file /usr/local/mariadb/columnstore/data/bulk/tmpjob/3163_D20170624_T103843_S950145_Job_3163.xml 2017-06-24 10:38:43 (29756) INFO : Job file loaded, run time for this step : 0.0321331 seconds 2017-06-24 10:38:43 (29756) INFO : PreProcessing check starts 2017-06-24 10:38:43 (29756) INFO : input data file /home/vagrant/histdata/gbpusd2016.csv 2017-06-24 10:38:43 (29756) INFO : PreProcessing check completed 2017-06-24 10:38:43 (29756) INFO : preProcess completed, run time for this step : 0.0329528 seconds 2017-06-24 10:38:43 (29756) INFO : No of Read Threads Spawned = 1 2017-06-24 10:38:43 (29756) INFO : No of Parse Threads Spawned = 3 2017-06-24 10:38:45 (29756) INFO : For table forex.gbpusd: 372,480 rows processed and 372480 rows inserted. 2017-06-24 10:38:46 (29756) INFO : Bulk load completed, total run time : 2.11976 seconds DB table
  20. 20. if cpimport failed... # cpimport forex gbpusd gbpusd2016.csv Locale is : C Using table OID 3163 as the default JOB ID Input file(s) will be read from : /home/vagrant/histdata Job description file : /usr/local/mariadb/columnstore/data/bulk/tmpjob/3163_D20170624_T104034_S269473_Job_3163.xml Log file for this job: /usr/local/mariadb/columnstore/data/bulk/log/Job_3163.log 2017-06-24 10:40:34 (30209) INFO : successfully loaded job file /usr/local/mariadb/columnstore/data/bulk/tmpjob/3163_D20170624_T104034_S269473_Job_3163.xml 2017-06-24 10:40:34 (30209) INFO : Job file loaded, run time for this step : 0.0253589 seconds 2017-06-24 10:40:34 (30209) INFO : PreProcessing check starts 2017-06-24 10:40:34 (30209) INFO : input data file /home/vagrant/histdata/gbpusd2016.csv 2017-06-24 10:40:34 (30209) INFO : PreProcessing check completed 2017-06-24 10:40:34 (30209) INFO : preProcess completed, run time for this step : 0.065531 seconds 2017-06-24 10:40:34 (30209) INFO : No of Read Threads Spawned = 1 2017-06-24 10:40:34 (30209) INFO : No of Parse Threads Spawned = 3 2017-06-24 10:40:34 (30209) INFO : Number of rows with errors = 11. Row numbers with error reasons are listed in file /home/vagrant/histdata/gbpusd2016.csv.Job_3163_30209.err 2017-06-24 10:40:34 (30209) INFO : Number of rows with errors = 11. Exact error rows are listed in file /home/vagrant/histdata/gbpusd2016.csv.Job_3163_30209.bad 2017-06-24 10:40:34 (30209) ERR : Actual error row count(11) exceeds the max error rows(10) allowed for table forex.gbpusd [1451] 2017-06-24 10:40:34 (30209) CRIT : Bulkload Read (thread 0) Failed for Table forex.gbpusd. Terminating this job. [1451] 2017-06-24 10:40:34 (30209) INFO : Bulkload Parse (thread 2) Stopped parsing Tables. BulkLoad::parse() responding to job termination 2017-06-24 10:40:34 (30209) INFO : Bulkload Parse (thread 1) Stopped parsing Tables. BulkLoad::parse() responding to job termination 2017-06-24 10:40:34 (30209) INFO : Bulkload Parse (thread 0) Stopped parsing Tables. BulkLoad::parse() responding to job termination 2017-06-24 10:40:34 (30209) INFO : Table forex.gbpusd (OID-3163) was not successfully loaded. Rolling back. 2017-06-24 10:40:34 (30209) INFO : Bulk load completed, total run time : 0.638649 seconds
  21. 21. verify your Job_xxxx_xxxxx.err gbpusd2016.csv.Job_3163_30209.err : Line number 1; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 2; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 3; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 4; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 5; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 6; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 7; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 8; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 9; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 10; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1 Line number 11; Error: Data contains wrong number of columns; num fields expected-6; num fields found-1
  22. 22. Bulk import performance
  23. 23. performance LOAD DATA LOCAL INFILE # mcsmysql --local-infile=1 forex Welcome to the MariaDB monitor. Commands end with ; or g. Your MariaDB connection id is 38 Server version: 10.1.23-MariaDB Columnstore 1.0.9-1 Copyright (c) 2000, 2017, Oracle, MariaDB Corporation Ab and others. Type 'help;' or 'h' for help. Type 'c' to clear the current input statement. MariaDB [forex]> LOAD DATA LOCAL INFILE 'gbpusd2016.csv' INTO TABLE gbpusd FIELDS TERMINATED BY ','; Query OK, 372480 rows affected (1.52 sec) Records: 372480 Deleted: 0 Skipped: 0 Warnings: 0
  24. 24. Performance ColumnStore : cpimport # cpimport -s ',' forex gbpusd gbpusd2016.csv 2017-06-24 10:38:45 (29756) INFO : For table forex.gbpusd: 372480 rows processed and 372480 rows inserted. 2017-06-24 10:38:46 (29756) INFO : Bulk load completed, total run time : 2.11976 seconds -s: field separator
  25. 25. cpimport • 2 sec. for 372,480 rows LOAD DATA LOCAL INFILE • 372480 rows affected (1.52 sec) CSV import: cpimport vs. LOAD DATA LOCAL INFILE
  26. 26. Performance ColumnStore : INSERT INTO INSERT INTO gbpusd_idb(id, time, open, high, low, close) VALUES('1', '2016-01-03 17:00', '1.473350', '1.473350', '1.473290', '1.473290'); INSERT INTO gbpusd_idb(id, time, open, high, low, close) VALUES('2', '2016-01-03 17:01', '1.473280', '1.473360', '1.473260', '1.473350'); INSERT INTO gbpusd_idb(id, time, open, high, low, close) VALUES('3', '2016-01-03 17:02', '1.473350', '1.473350', '1.473290', '1.473290'); ... MariaDB [forex]> source gbpusd2016.sql ... MariaDB [forex]> Bye real 18m16.178s user 0m28.330s sys 0m23.551s
  27. 27. Simply query w/ SQLPad
  28. 28. SQLPad - https://github.com/rickbergfalk/sqlpad
  29. 29. SQLPad installation / launch # yum -y install npm (EPEL repository required) # npm install sqlpad -g $ sqlpad -ip 0.0.0.0 --port 3000 Launching server WITHOUT SSL Welcome to SQLPad!. Visit http://localhost:3000 to get started
  30. 30. UK votes to leave EU after dramatic night divides nation • https://www.theguardian.com/politics/2016/jun/24/br itain-votes-for-brexit-eu-referendum-david-cameron The value of the pound swung wildly on currency markets as initial confidence among investors expecting a remain vote was dented by some of the early referendum results, triggering falls of close to 10% and its biggest one-day fall ever.
  31. 31. simple query for time period before/after vote
  32. 32. simple query for time period before/after vote
  33. 33. Query Results Visualization
  34. 34. MariaDB ColumnStore Window Functions
  35. 35. Supported Window Functions Function Description AVG() The average of all input values. COUNT() Number of input rows. CUME_DIST() Calculates the cumulative distribution, or relative rank, of the current row to other rows in the same partition. Number of peer or preceding rows / number of rows in partition. DENSE_RANK() Ranks items in a group leaving no gaps in ranking sequence when there are ties. FIRST_VALUE() The value evaluated at the row that is the first row of the window frame (counting from 1); null if no such row.
  36. 36. Supported Window Functions (cont’d) Function Description LAG() The value evaluated at the row that is offset rows before the current row within the partition; if there is no such row, instead return default. Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to null. LAG provides access to more than one row of a table at the same time without a self-join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position. LAST_VALUE() The value evaluated at the row that is the last row of the window frame (counting from 1); null if no such row. LEAD() Provides access to a row at a given physical offset beyond that position. Returns value evaluated at the row that is offset rows after the current row within the partition; if there is no such row, instead return default. Both offset and default are evaluated with respect to the current row. If omitted, offset defaults to 1 and default to null. MAX() Maximum value of expression across all input values.
  37. 37. Supported Window Functions (cont’d) Function Description MEDIAN() An inverse distribution function that assumes a continuous distribution model. It takes a numeric or datetime value and returns the middle value or an interpolated value that would be the middle value once the values are sorted. Nulls are ignored in the calculation. MIN() Minimum value of expression across all input values. NTH_VALUE() The value evaluated at the row that is the nth row of the window frame (counting from 1); null if no such row. NTILE() Divides an ordered data set into a number of buckets indicated by expr and assigns the appropriate bucket number to each row. The buckets are numbered 1 through expr. The expr value must resolve to a positive constant for each partition. Integer ranging from 1 to the argument value, dividing the partition as equally as possible. PERCENT_RANK() relative rank of the current row: (rank - 1) / (total rows - 1).
  38. 38. Supported Window Functions (cont’d) Function Description PERCENTILE_CONT() An inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation. PERCENTILE_DISC() An inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation. RANK() rank of the current row with gaps; same as row_number of its first peer. ROW_NUMBER() number of the current row within its partition, counting from 1 STDDEV() STDDEV_POP() Computes the population standard deviation and returns the square root of the population variance.
  39. 39. Supported Window Functions (cont’d) Function Description STDDEV_SAMP() Computes the cumulative sample standard deviation and returns the square root of the sample variance. SUM() Sum of expression across all input values. VARIANCE() VAR_POP() Population variance of the input values (square of the population standard deviation). VAR_SAMP() Sample variance of the input values (square of the sample standard deviation).
  40. 40. MariaDB ColumnStore Aggregate Functions
  41. 41. MAX GBPUSD 23th - 25th June 2016
  42. 42. MIN GBPUSD 23th - 25th June 2016
  43. 43. Drop off rate GBPUSD 23th - 25th June 2016 -13% of drop off in a few hours
  44. 44. Correlation GBPUSD - USDJPY
  45. 45. Correlation GBPUSD - USDJPY @ Brexit scatter plot (normalized) GBPUSD*100-130 USDJPY-110
  46. 46. Correlation GBPUSD - USDJPY @ Brexit SELECT ( AVG( gbpusd.close * usdjpy.close ) - AVG( gbpusd.close ) * AVG( usdjpy.close ) ) / ( STDDEV(gbpusd.close) * STDDEV(usdjpy.close) ) AS correlation_coefficient_population FROM usdjpy INNER JOIN gbpusd ON gbpusd.time = usdjpy.time WHERE gbpusd.time BETWEEN TIMESTAMP ( '2016-06-22' ) AND TIMESTAMP ( '2016-06-26' ); Pearson correlation coefficient
  47. 47. Correlation GBPUSD - USDJPY @ Brexit Scatter Plot (normalized) correlation coeff. 94.4 % : highly correlated GBPUSD*100-130
  48. 48. Correlation GBPUSD - USDJPY 2016 (Jan. - Dec.) Scatter Plot (normalized) correlation coeff. 36%: low correlation GBPUSD*100-130 USDJPY-110
  49. 49. Performance - ColumnStore vs. InnoDB ColumnStore storage engine: +------------------------------------+ | correlation_coefficient_population | +------------------------------------+ | 0.9648375371071727 | +------------------------------------+ 1 row in set (0.43 sec) > 1000 times faster than InnoDB SELECT (AVG(gbpusd.close*usdjpy.close) - AVG(gbpusd.close)*AVG(usdjpy.close)) / (STDDEV(gbpusd.close) * STDDEV(usdjpy.close)) AS correlation_coefficient_population FROM gbpusd JOIN usdjpy ON gbpusd.time = usdjpy.time WHERE gbpusd.time BETWEEN TIMESTAMP('2016-06-23') AND TIMESTAMP('2016-06-25'); InnoDB storage engine: +------------------------------------+ | correlation_coefficient_population | +------------------------------------+ | 0.964837537107134 | +------------------------------------+ 1 row in set (8 min 11.21 sec)
  50. 50. Moving Average w/ Window Functions
  51. 51. Moving Average GBPUSD SELECT time, close, AVG(close) OVER ( ORDER BY time ASC ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING ) AS MA13, COUNT(close) OVER ( ORDER BY time ASC ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING ) AS row_count FROM gbpusd WHERE time BETWEEN TIMESTAMP('2016-06-23') AND TIMESTAMP('2016-06-25');
  52. 52. Moving Average GBPUSD AVG(close) OVER ( ORDER BY time ASC ROWS BETWEEN 6 PRECEDING AND 6 FOLLOWING ) AS MA13 time close MA13 row count 6/23/2016 00:00 1.4797 1.4797 7 preceding 6 6/23/2016 00:01 1.4798 1.4797 8 preceding 5 6/23/2016 00:02 1.4798 1.4796 9 preceding 4 6/23/2016 00:03 1.4797 1.4796 10 preceding 3 6/23/2016 00:04 1.4796 1.4796 11 preceding 2 6/23/2016 00:05 1.4796 1.4796 12 preceding 1 6/23/2016 00:06 1.4796 1.4796 13 current row 6/23/2016 00:07 1.4796 1.4796 13 following 1 6/23/2016 00:08 1.4796 1.4796 13 following 2 6/23/2016 00:09 1.4796 1.4796 13 following 3 6/23/2016 00:10 1.4796 1.4797 13 following 4 6/23/2016 00:11 1.4796 1.4797 13 following 5 6/23/2016 00:12 1.4797 1.4797 13 following 6
  53. 53. raw data GBPUSD M1 53
  54. 54. Moving Average 13 GBPUSD
  55. 55. summary • Free Forex time series history data analyzed with : – Simple analytic queries(aggregate functions) w/ SQLPad – Moving average using Window Function
  56. 56. Thank you!
  57. 57. Appendix
  58. 58. MariaDB Partners w/ Global Visual Analytics Leader Tableau https://mariadb.com/about-us/newsroom/press-releases/fastest-growing-open-source-database -mariadb-partners-global Fastest Growing Open Source Database MariaDB Partners With Global Visual Analytics Leader Tableau Combination of ubiquitous database and visual analytics technologies accelerates delivery of business insights MENLO PARK, Calif. and HELSINKI – December 12, 2017 – MariaDB® Corporation, the company behind the fastest growing open source database, today announced Tableau Software, the global leader in visual analytics, has certified MariaDB integration with Tableau’s business intelligence (BI) and visual analytics platform. Bringing together the highly popular data management products and the renowned visualization technologies means businesses globally can confidently use these preferred solutions for reliable, fast, data-driven business decisions.
  59. 59. Analyzing Queries in ColumnStore
  60. 60. Analyzing Queries : select calGetStats(); https://mariadb.com/kb/en/library/analyzing-queries-in-columnstore/ MariaDB [forex]> select calGetStats(); Query Stats: MaxMemPct-1; NumTempFiles-0; TempFileSpace-0B; ApproxPhyI/O-0; CacheI/O-6298; BlocksTouched-6298; PartitionBlocksEliminated-0; MsgBytesIn-4MB; MsgBytesOut-11MB; Mode-Distributed
  61. 61. Analyzing Queries : select calGetStats(); • MaxMemPct - Peak memory utilization on the User Module, likely in support of a large (User Module) based hash join operation. • NumTempFiles - Report on any temporary files created in support of query operations larger than available memory, typically for unusual join operations where the smaller table join cardinality exceeds some configurable threshold. • TempFileSpace - Report on space used by temporary files created in support of query operations larger than available memory, typically for unusual join operations where the smaller table join cardinality exceeds some configurable threshold. • PhyI/O - Number of 8k blocks read from disk, SSD, or other persistent storage. • CacheI/O - Approximate number of 8k blocks processed in memory, adjusted down by the number of discrete PhyI/O calls required. • BlocksTouched - Approximate number of 8k blocks processed in memory. • PartitionBlocksEliminated - The number of block touches eliminated via the Extent Map elimination behavior. MsgBytesIn, MsgByteOut - Message size in MB sent between nodes in support of the query.
  62. 62. Analyzing Queries : calSetTrace(1); calGetTrace(); MariaDB [test]> calSetTrace(1); MariaDB [test]> select c_name, sum(o_totalprice) from customer, orders where o_custkey = c_custkey and c_custkey = 5 group by c_name; +--------------------+-------------------+ | c_name | sum(o_totalprice) | +--------------------+-------------------+ | Customer#000000005 | 684965.28 | +--------------------+-------------------+ 1 row in set, 1 warning (0.34 sec) MariaDB [test]> select calGetTrace(); Desc Mode Table TableOID ReferencedColumns PIO LIO PBE Elapsed Rows BPS PM customer 3024 (c_custkey,c_name) 0 43 36 0.006 1 BPS PM orders 3038 (o_custkey,o_totalprice) 0 766 0 0.032 3 HJS PM orders-customer 3038 - - - - ----- - TAS UM - - - - - - 0.021 1
  63. 63. Analyzing Queries : calSetTrace(1); calGetTrace(); Desc – Operation being executed. Possible values: ● BPS - Batch Primitive Step : scanning or projecting the column blocks. ● CES - Cross Engine Step: Performing Cross engine join ● DSS - Dictionary Structure Step : a dictionary scan for a particular variable length string value. ● HJS - Hash Join Step : Performing a hash join between 2 tables ● HVS - Having Step: Performing the having clause on the result set ● SQS - Sub Query Step: Performaning a sub query ● TAS - Tuple Aggregation step : the process of receiving intermediate aggregation results at the UM from the PM nodes. ● TNS - Tuple Annexation Step : Query result finishing, e.g. filling in constant columns, limit, order by and final distinct cases. ● TUS = Tuple Union step : Performing a SQL union of 2 sub queries. ● TCS = Tuple Constant Step: Process Constant Value Columns ● WFS = Window Function Step: Performing a window function.
  64. 64. Analyzing Queries : calSetTrace(1); calGetTrace(); • Mode – Where the operation was performed: UM or PM • Table – Table for which columns may be scanned/projected. • TableOID – ObjectID for the table being scanned. • ReferencedOIDs – ObjectIDs for the columns required by the query. • PIO – Physical I/O (reads from storage) executed for the query. • LIO – Logical I/O executed for the query, also known as Blocks Touched. • PBE – Partition Blocks Eliminated identifies blocks eliminated by Extent Map min/max. • Elapsed – Elapsed time for a give step. • Rows – Intermediate rows returned
  65. 65. ColumnStore Architecture
  66. 66. MariaDB ColumnStore Architecture Columnar Distributed Data Storage Local Storage | SAN | EBS | Gluster FS BI Tool SQL Client Custom Big Data App Application MariaDB SQL Front End Distributed Query Engine Data Storage User Module (UM) Performance Module (PM)
  67. 67. Row-oriented vs. Column-oriented format •Row oriented –Rows stored sequentially in a file –Scans through every record row by row •Column oriented: –Each column is stored in a separate file –Scans the only relevant column ID Fname Lname State Zip Phone Age Sex 1 Bugs Bunny NY 11217 (718) 938-3235 34 M 2 Yosemite Sam CA 95389 (209) 375-6572 52 M 3 Daffy Duck NY 10013 (212) 227-1810 35 M 4 Elmer Fudd ME 04578 (207) 882-7323 43 M 5 Witch Hazel MA 01970 (978) 744-0991 57 F ID 1 2 3 4 5 Fname Bugs Yosemite Daffy Elmer Witch Lname Bunny Sam Duck Fudd Hazel State NY CA NY ME MA Zip 11217 95389 10013 04578 01970 Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 (978) 744-0991 Age 34 52 35 43 57 Sex M M M M F SELECT Fname FROM Table 1 WHERE State = 'NY' 67
  68. 68. High Performance Query Processing Horizontal Partition: 8 Million Rows Extent 2 Horizontal Partition: 8 Million Rows Extent 3 Horizontal Partition: 8 Million Rows Extent 1 Storage Architecture reduces I/O • Only touch column files that are in projection, filter and join conditions • Eliminate disk block touches to partitions outside filter and join conditions Extent 1: Min State: CA, Max State: NY Extent 2: Min State: OR, Max State: WY Extent 3: Min State: IA, Max State: TN SELECT Fname FROM Table 1 WHERE State = ‘NY’ ID 1 2 3 4 ... 8M 8M+1 ... 16M 16M+1 ... 24M Fname Bugs Yosemite Daffy Hazel ... ... Jane ... Elmer Lname Bunny Sam Duck Fudd ... ... ... State NY CA NY ME ... MN WY TX OR ... VA TN IA NY ... PA Zip 11217 95389 10013 04578 ... ... ... Phone (718) 938-3235 (209) 375-6572 (212) 227-1810 (207) 882-7323 ... ... ... Age 34 52 35 43 ... ... ... Sex M M M F ... ... ... Vertical Partition Vertical Partition Vertical Partition Vertical Partition Vertical Partition … ELIMINATED PARTITION
  69. 69. sizing ColumnStore
  70. 70. Sizing Minimum Spec UM 4 core, 32 G RAM PM 4 core, 16 G RAM Typical Server spec PM 8 core 64G RAM UM 8 core, 264G RAM Data Storage External Data Volumes • Maximum 2 data volume per IO channel per PM node server • up to 2TB on the disk per data volume ≈ Max 4 TB per PM node Local disk Up to 2TB on the disk per PM node server DETAILED SIZING GUIDE based on data size and workload
  71. 71. MariaDB ColumnStore Sizing - Example • 60TB uncompressed data = 6TB compressed data at 10x compression • 2UM - 8 core 512GB(based on workload) • 6 TB compressed = 3 data volume (at 2TB per volume) -with 1 data volume per PM node - 3PMs • Data growth - 2TB per month, Data retention - 2 years -Plan for 2TB X24 = 48 TB additional -48 TB = 4.8TB compressed ≈ 3 data volume(at 2TB per volume) with 1 data volume per PM node - 3 additional PMs • Total 6 PMs, 2 UMs
  72. 72. using ColumnStore via SSL/TLS connection
  73. 73. SSL variables w/o SSL MariaDB [(none)]> SHOW VARIABLES LIKE '%ssl%'; +---------------------+---------------------------------+ | Variable_name | Value | +---------------------+---------------------------------+ | have_openssl | YES | | have_ssl | DISABLED | | ssl_ca | | | ssl_capath | | | ssl_cert | | | ssl_cipher | | | ssl_crl | | | ssl_crlpath | | | ssl_key | | | version_ssl_library | OpenSSL 1.0.1e-fips 11 Feb 2013 | +---------------------+---------------------------------+
  74. 74. /usr/local/mariadb/columnstore/mysql/my.cnf [client] ssl-ca = /etc/pki/tls/mariadb/certs/ca-cert.pem ssl-cert = /etc/pki/tls/mariadb/certs/client-cert.pem ssl-key = /etc/pki/tls/mariadb/private/client-key.pem [mysqld] ssl-ca = /etc/pki/tls/mariadb/certs/ca-cert.pem ssl-cert = /etc/pki/tls/mariadb/certs/server-cert.pem ssl-key = /etc/pki/tls/mariadb/private/server-key.pem
  75. 75. SSL variables : SSL enabled MariaDB [(none)]> SHOW VARIABLES LIKE '%ssl%'; +---------------------+---------------------------------------------+ | Variable_name | Value | +---------------------+---------------------------------------------+ | have_openssl | YES | | have_ssl | YES | | ssl_ca | /etc/pki/tls/mariadb/certs/ca-cert.pem | | ssl_capath | | | ssl_cert | /etc/pki/tls/mariadb/certs/server-cert.pem | | ssl_cipher | | | ssl_crl | | | ssl_crlpath | | | ssl_key | /etc/pki/tls/mariadb/private/server-key.pem | | version_ssl_library | OpenSSL 1.0.1e-fips 11 Feb 2013 | +---------------------+---------------------------------------------+
  76. 76. status : SSL enabled MariaDB [(none)]> status -------------- /usr/local/mariadb/columnstore/mysql/bin/mysql Ver 15.1 Distrib 10.1.23-MariaDB, for Linux (x86_64) using readline 5.1 Connection id: 5 Current database: Current user: root@localhost SSL: Cipher in use is DHE-RSA-AES256-GCM-SHA384 Current pager: stdout Using outfile: '' Using delimiter: ; Server: MariaDB Server version: 10.1.23-MariaDB Columnstore 1.0.9-1 Protocol version: 10 Connection: Localhost via UNIX socket Server characterset: latin1 Db characterset: latin1 Client characterset: utf8 Conn. characterset: utf8 UNIX socket: /usr/local/mariadb/columnstore/mysql/lib/mysql/mysql.sock

×