Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How Database Convergence Impacts the Coming Decades of Data Management

209 views

Published on

How Database Convergence Impacts the Coming Decades of Data Management by Nikita Shamgunov, CEO and co-founder of MemSQL.

Presented at NYC Database Month in October 2017. NYC Database Month is the largest database meetup in New York, featuring talks from leaders in the technology space. You can learn more at http://www.databasemonth.com.

Published in: Technology
  • Be the first to comment

How Database Convergence Impacts the Coming Decades of Data Management

  1. 1. How Database Convergence Impacts the Coming Decades of Data Management Nikita Shamgunov CEO and co-founder of MemSQL
  2. 2. 2 MISSION Growth of digital business impacting data architectures We make every company a real-time enterprise PRODUCT Top Ranked Operational Data Warehouse MemSQL provides you the ability to learn and react in real time ABOUT Founders are former Facebook, SQL Server database engineers $85m in funding from Top Tier investors; Enterprise Customers: MemSQL at a Glance
  3. 3. Converge Transactions and Analytics in a Relational Database ● New breed of applications ○ Analytics as part of a transaction ○ Analytics when the data is born ○ In database AI/ML ● Scalable OLAP and OLTP in one system ○ Fewer systems to manage ○ Utility database consumption ○ Supports HTAP
  4. 4. Traditional + Future Architecture 4 In-Memory Data Store Analytics, Historical Reporting and Data Discovery Analytic Apps DMSA Data Integration Transactions + Operational Analytics Traditional Reports and Analytics IoT Data Social Data RAM ? HTAP Apps Analytic Apps
  5. 5. The New Data Architecture without DMSA 5 Transactions + Operational Reports ? IoT Data Social Data In-Memory Data Store HTAP Apps Analytic Apps RAM Analytics, Historical Reporting and Data Discovery
  6. 6. The Enterprise Requires Performance 6 FAST Data Loading Stream data Real-time loading Full data access LOW Query Latency Vectorized queries Real-time dashboards Live data access Multi-threaded processing Transactions and Analytics Scalable performance HIGH Concurrency
  7. 7. ● Focus on analytics and Deliver a Hybrid Cloud Data Warehouse ○ Hybrid-cloud ○ Scalable with integration to data lakes ○ Real-time ○ Simplicity ● Converge transactions and analytics ○ Transaction support ○ Multi-cloud reliability ○ Application support North Star. Build a New Category of Databases
  8. 8. Real-time and Query Performance Goals: Eliminate batching and deliver instant results to user or app ● Investments ○ Streaming ingest ■ Kafka ■ Kinesis ○ Transactional consistency ■ Ability to change data rapidly ■ Ability to scale analytics to millions requests a second to enable self service customer customer analytics ○ Query performance ■ Scale out ■ Vectorization ● Results ○ Dramatic query performance improvements for BI use cases ○ PIPELINEs adoption is growing
  9. 9. Simplicity ● Goals ○ No knobs where you don’t need them ○ Data warehousing workloads work out of the box ○ No hints for queries ○ No scaling limits ● Investments ○ Query optimization and query execution ● Timelines ○ Several releases in 2017
  10. 10. 10 ▪ Columnstore • On disk with working set in memory • Super fast scans • Support analytical and data warehousing workloads • One index • Petabyte scale Access Methods ▪ Rowstore • Fully in memory • Submillisecond point updates • Multiple indexes
  11. 11. 11 ▪ Supports multi-statement transactions ▪ Supports MVCC Scale out and Transactional ▪ Scalable on commodity hardware ▪ Data hash partitioned and stored in two copies
  12. 12. Query Processing 12 MemSQL Confidential
  13. 13. 13 Query Performance ▪ Group-By/Aggregate Performance • Operations on encoded data • Single-instruction multiple data (SIMD) ▪ Filter pushdown to column store ▪ Preference for dictionary compression MemSQL Confidential
  14. 14. 14 SIMD overview ▪ Intel AVX-2 ▪ 256-bit registers ▪ Pack multiple values per register ▪ Special instructions for SIMD register operations ▪ Arithmetic, logic, load, store etc. ▪ Allows multiple operations in 1 instruction 1 2 3 4 1 1 1 1 2 3 4 5 + MemSQL Confidential
  15. 15. 15 Filter pushdown to dictionary ▪ Example: • FactClick(id, region_id, …) • Select region_id, count(*) from FactClick where region_id like ’%east%’ • region_id has only a few dozen values • It is dictionary-encoded MemSQL Confidential
  16. 16. 16 Segment-level filter pushdown ▪ E.g. 6 regions ▪ 1M rows per segment ▪ WHERE region_id like ‘%east%’ ▪ 6 string comparisons/segment ▪ Cache lookup table L: [true, true, false, false, false, false] ▪ Output only rows where L[dictionary_id] = true dictionary_id Region 0 Northeast 1 Southeast 2 North-central 3 South-central 4 Northwest 5 Southwest Dictionary MemSQL Confidential
  17. 17. MemSQL Confidential17 Performance ▪ Improved Group-By/Aggregate, up to 80X ▪ Columnstore string filter pushdown ▪ Improved sort performance (can be by 2-3X) ▪ Unenforced uniqueness constraints with RELY option ▪ Query optimizer improvements ▪ Columnstore update ▪ Columnstore JSON
  18. 18. MemSQL Confidential18 Automatic Statistics ▪ Always-on cardinality statistics for every column ▪ For columnstore tables only ▪ On by default ▪ Will result in better query plans with less DBA involvement to run ANALYZE TABLE and tune queries
  19. 19. Columnstore update performance ▪ Ability to update rows identified via columnstore sort key ▪ Uses in-memory index on row store segment 19 Row Store Segment Col Store Segment Col Store Segment Index on Sort Key … Seek
  20. 20. New Query Features ▪ Cross-database queries (joins, insert-select) ▪ UPDATE/DELETE with joins ▪ UPDATE with subselect in SET clause ▪ reference_table LEFT JOIN …; (select …) LEFT JOIN… now supported ▪ Window functions with complex frames • E.g. avg (a) over (order by b rows between 5 preceding and current row) ▪ New window functions • first_value, last_value, nth_value, percentile_cont, percentile_disc ▪ Unenforced unique constraint + RELY 20 MemSQL Confidential
  21. 21. Extensibility Features ▪ Major, release-defining feature set ▪ User-defined • Stored procedures (SPs) • Scalar-valued functions (UDFs) • Table-valued functions (TVFs) • Aggregate functions (UDAFs) ▪ Highlights • SQL-developer friendly, clean syntax (no @, $ etc.) • Compiled to machine code for speed • Array and record support 21 MemSQL Confidential
  22. 22. Example UDF: normalize_string() select normalize_string("     Abc    XYZ  "); abc xyz 22 MemSQL Confidential
  23. 23. Implementation of normalize_string() delimiter // create or replace function normalize_string(str varchar(255)) returns varchar(255) as declare   r varchar(255) = ""; i int; previousChar char; nextChar char; s varchar(255); begin   s = lower(trim(str));   if length(s) = 0 then return s; end if;   previousChar = substr(s, 1, 1);   r = concat(r, previousChar);   i = 2;   while i <= length(s) loop     nextChar = substr(s, i, 1);     if not(previousChar = ' ' and nextChar = ' ') then       r = concat(r, substr(s, i, 1));     end if;     previousChar = nextChar;     i += 1;   end loop;   return r; end // 23 MemSQL Confidential
  24. 24. Example SP: Move data more than 5 minutes old from t1 to t2; create table t1(a int, ts datetime); create table t2(a int, ts datetime); … create or replace procedure myMove() as declare   boundary datetime = date_add(now(), interval -5 minute); begin   insert into t2 select * from t1 where ts < boundary;   delete from t1 where ts < boundary; end; 24 MemSQL Confidential
  25. 25. Example TVF create table t (i int); insert into t values (1),(2),(3),(4),(5); create function basic(l int) returns table as return select * from t limit l; memsql> select * from basic(0); Empty set (0.00 sec) memsql> select * from basic(2); +------+ | i    | +------+ |    3 | |    2 | +------+ 2 rows in set (0.01 sec) 25 MemSQL Confidential
  26. 26. User-Defined Aggregate Functions (UDAFs) ▪ Used like built-in aggregates like SUM() ▪ Based on 4 user-defined functions • Initialize • Iterate • Merge • Terminate 26 MemSQL Confidential
  27. 27. Example UDAF -- pick any arbitrary value from input delimiter // create function any_init() returns int as begin return -1; end;// create function any_iter(s int, v int) returns int as begin return v; end;// create function any_merge(s1 int, s2 int) returns int as begin if s1 = -1 then return s2; else return s1; end if; end;// create function any_terminate(s int) returns int as begin return s; end;// delimiter ; create aggregate any_val(int) returns int with state int initialize with any_init iterate with any_iter merge with any_merge terminate with any_terminate; 27 MemSQL Confidential
  28. 28. UDAF Output create table t(g int, x int); insert into t values (100, 10), (100, 12), (100, 14), (200, 21), (200, 27); select g, any_val(x) from t group by g; memsql> select g, any_val(x) from t group by g; +------+------------+ | g | any_val(x) | +------+------------+ | 100 | 10 | | 200 | 27 | +------+------------+ 2 rows in set (0.00 sec) 28 MemSQL Confidential
  29. 29. SCALAR (get a scalar query result) 29 create table t (i int); insert into t values (1), (2), (3), (4), (5); create or replace procedure scalar_basic() as declare v query(i int) = select max(i) from t; s int = scalar(v); begin call tracelog(s); end; MemSQL Confidential
  30. 30. 30 COLLECT create or replace procedure p_coll() as declare c array(record(v varchar(80))); t query(v varchar(80)) = select v from r order by v; begin delete from proc_log; c = collect(t); for x in c loop call tracelog(x.v); end loop; end; MemSQL Confidential
  31. 31. CALL and ECHO ▪ call sp_name(args) • When no need to output rowset ▪ echo sp_name(args) • Outputs rowset to client ▪ Exception handling supported 31 MemSQL Confidential
  32. 32. Performance ▪ Compiled to machine code using LLVM ▪ UDFs are inlined when appropriate 32 MemSQL Confidential
  33. 33. Distributed Execution ▪ SPs run on aggregator ▪ From SPs, parameters and variables are substituted as strings on aggregator before execution on leaves ▪ UDFs can run on any node • Aggregators or leaves • Multiple invocations can run in parallel within a query 33 MemSQL Confidential
  34. 34. Summary ▪ New, user-defined • Scalar functions • Stored procedures • Table-valued functions • Aggregate functions ▪ Friendly to experienced SQL developers ▪ Array and record types supported ▪ High-performance through compilation to machine code 34 MemSQL Confidential
  35. 35. Thank you memsql.com

×