• Save
Spring One 2012 Presentation – Effective design patterns with NewSQL
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Spring One 2012 Presentation – Effective design patterns with NewSQL

on

  • 2,809 views

NewSQL is a term that describes the next generation of highly distributed, scalable, memory oriented SQL databases. In this session, we will explore some basic concepts in NewSQL (VMWare SQLFire), ...

NewSQL is a term that describes the next generation of highly distributed, scalable, memory oriented SQL databases. In this session, we will explore some basic concepts in NewSQL (VMWare SQLFire), translate a traditional “Star” schema to a partitioned schema (scale out design), walk through various SQL usage patterns – simple queries, complex joins, aggregations, stored procedures and explain how they can be more effectively realized in SQLFire through replicated tables, partitioned tables, in-memory or disk resident tables, parallel procedure execution, distributed transactions, etc.

We will also compare and contrast various NewSQL features with traditional SQL.

Statistics

Views

Total Views
2,809
Views on SlideShare
1,330
Embed Views
1,479

Actions

Likes
2
Downloads
0
Comments
0

3 Embeds 1,479

http://blogs.vmware.com 1444
https://twitter.com 33
http://feeds.feedburner.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Relational databases are not predictable or reliable in terms of consistent performance for a number of reasons.Firstly, every query uses a different amount of resources.  A query could consume 1 or 2 I/O’s or 1 or 2 million I/O’s depending on how the query is written, what data is selected and factors such as how the database is indexed.  Performance is further varied by how the database is maintained (fragmentation).  What makes matters more complex is that different predicate values for a query can have vastly given data distributions.  The same query executed with different constants can have vastly different resource requirements.Because every query has a different “footprint” running a query in isolation does not provide indicative statistics on how that query will perform under concurrent load.  In fact is become impossible to predict the exact execution duration of a relational database query as its performance will be dependent on what else is being executed at the exact moment it is.Cost based optimizations can also change the plans dynamically resulting in, again, a variance in execution times. Essentially, a lot gets done to reduce disk IO which becomes a serious source of bottleneck.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.
  • There are a lot of different ways to partition data in SQLFire, by default SQLFire will try to evenly distribute data at random across all servers If that's not good enough you can exert a lot of control over how data is divided and distributed using list, range or expression based partitioning.

Spring One 2012 Presentation – Effective design patterns with NewSQL Presentation Transcript

  • 1. Effective design patterns with NewSQL Jags Ramnarayan, Chief Architect, GemFire/SQLFire, vFabric Guillermo Tantachuco, Regional Sr. Systems Engineer, vFabric© 2012 SpringOne 2GX. All rights reserved. Do not distribute without permission.
  • 2. We Challenge the traditional RDBMS design NOT SQL Buffers primarily First write totuned for IO LogSecond writeto Data Files • Too much I/O • Design roots don‟t necessarily apply today • Too much focus on ACID 2 • Disk synchronization bottlenecks
  • 3. Achieving consistent response times is challenging – Resources (memory, IO) consumed can vary a lot – Highly selective query using an index can be very fast one moment • a high cache hit rate most of the times – But, complex concurrent queries may wipe out the buffers causing a huge spike in IO the next moment http://blog.tonybain.com/tony_bain/2009/05/the-problem-with-the-relational-database-part-2-predictability.html3
  • 4. Common themes in next-gen DB architectures “Shared nothing” commodity clusters focus shifts to memory, distributing data and clustering Scale by partitioning the data and move behavior to data nodes HA within cluster and across data centers Add capacity to scale dynamically 4NoSQL, Data Grids, Data Fabrics, NewSQL
  • 5. But, what about sharding?• Sharding works but can be huge burden over time• Querying across partitions – A simple nested loop join can be very expensive – Aggregations, ordering, Groupings have to be hand coded – Managing large intermediate data sets become an app problem• Transactions – Cross partition transactions are not possible – Loss of atomicity/isolation means compensatory code needs to be built• Management, elasticity – Cannot expand cluster size on demand – Management in general is difficult5
  • 6. NewSQL Concepts with VMWare SQLFire• Main memory oriented Clustered SQL DB• NoSQL characteristics of scalability, performance, availability but retains support for distributed transactions, SQL querying• It is also designed so you can use it as a operational layer in front of your legacy databases through a caching framework
  • 7. SQLFire at a glance Tables can be replicated or partitioned. Replication within cluster is synchronousExpand cluster on demand Caching Framework – write Shared nothing „append through, write-behind to only‟ disk persistence RDBMS
  • 8. Partitioning & Replication
  • 9. Explore features using simple STAR schema FLIGHTAVAILABILITY --------------------------------------------- FLIGHTS FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , --------------------------------------------- FLIGHT_DATE DATE NOT NULL , ECONOMY_SEATS_TAKEN INTEGER , FLIGHT_ID CHAR(6) NOT NULL , ….. SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), 1–M PRIMARY KEY ( FLIGHT_ID, DEPART_TIME TIME, SEGMENT_NUMBER, ….. FLIGHT_DATE)) PRIMARY KEY (FLIGHT_ID, FOREIGN KEY (FLIGHT_ID, SEGMENT_NUMBER) SEGMENT_NUMBER) REFERENCES FLIGHTS ( FLIGHT_ID, SEGMENT_NUMBER) 1–1 FLIGHTHISTORY --------------------------------------------- SEVERAL CODE/DIMENSION TABLES FLIGHT_ID CHAR(6), --------------------------------------------- SEGMENT_NUMBER INTEGER, ORIG_AIRPORT CHAR(3), AIRLINES: AIRLINE INFORMATION (VERY STATIC) DEPART_TIME TIME, COUNTRIES : LIST OF COUNTRIES SERVED BY FLIGHTS DEST_AIRPORT CHAR(3), CITIES: ….. MAPS: PHOTOS OF REGIONS SERVED Assume, thousands of flight rows, millions of flightavailability records11
  • 10. Creating tables CREATE TABLE AIRLINES ( AIRLINE CHAR(2) NOT NULL PRIMARY KEY, AIRLINE_FULL VARCHAR(24), BASIC_RATE DOUBLE PRECISION, DISTANCE_DISCOUNT DOUBLE PRECISION,…. ); Table SQLF SQLF SQLF
  • 11. Replicated tables CREATE TABLE AIRLINES ( Design Pattern AIRLINE CHAR(2) NOT NULL PRIMARY KEY, Replicate reference tables in AIRLINE_FULL VARCHAR(24), STAR schemas BASIC_RATE DOUBLE PRECISION, (seldom change, often DISTANCE_DISCOUNT DOUBLE PRECISION,…. ) referenced in queries) REPLICATE; Replicated Table Replicated Table Replicated Table SQLF SQLF SQLF
  • 12. Partitioned tables CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , Design Pattern SEGMENT_NUMBER INTEGER NOT NULL , Partition Fact tables in STAR ORIG_AIRPORT CHAR(3), schemas for load balancing DEST_AIRPORT CHAR(3) (large, write heavy) DEPART_TIME TIME, FLIGHT_MILES INTEGER NOT NULL) PARTITION BY COLUMN(FLIGHT_ID); Replicated Table Replicated Table Table Replicated Table Partitioned Table Partitioned Table Partitioned Table SQLF SQLF SQLF
  • 13. Partitioned but highly available CREATE TABLE FLIGHTS ( Design Pattern FLIGHT_ID CHAR(6) NOT NULL , Increase redundant copies SEGMENT_NUMBER INTEGER NOT NULL , for HA and load balancing ORIG_AIRPORT CHAR(3), queries across replicas DEST_AIRPORT CHAR(3) DEPART_TIME TIME, FLIGHT_MILES INTEGER NOT NULL) PARTITION BY COLUMN (FLIGHT_ID) REDUNDANCY 1; Replicated Table Replicated Table Table Replicated Table Partitioned Table Partitioned Table Partitioned Table Redundant Partition Redundant Partition Redundant Partition SQLF SQLF SQLF
  • 14. Disk resident tables CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , Data dictionary is always ….. persisted in each server PARTITION BY COLUMN (FLIGHT_ID) PERSISTENT; sqlf backup /export/fileServerDirectory/sqlfireBackupLocation Replicated Table Replicated Table Table Replicated Table Partitioned Table Partitioned Table Partitioned Table Colocated Partition Colocated Partition Colocated Partition Redundant Partition Redundant Partition Redundant Partition Redundant Partition Redundant Partition Redundant Partition SQLF SQLF SQLF
  • 15. Partition by Primary KeyTo partition using the Primary Key, use: PARTITION BY PRIMARY KEY - Consistent hash on key resolves to a logical bucket - Buckets map to physical processes CREATE TABLE FLIGHTS ( (nodes) FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEST_AIRPORT CHAR(3) DEPART_TIME TIME, FLIGHT_MILES INTEGER NOT NULL, PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER) ) PARTITION BY PRIMARY KEY;
  • 16. Partition by Column(s) To partition using a column or columns, use: PARTITION BY COLUMN (column-name [ , column-name ]*)CREATE TABLE FLIGHTS ( - Hash key uses all partition columns FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ORIG_AIRPORT CHAR(3), DEST_AIRPORT CHAR(3) DEPART_TIME TIME, FLIGHT_MILES INTEGER NOT NULL, PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER) ) PARTITION BY COLUMN (FLIGHT_ID);
  • 17. Partition by List To partition based on specific column values: PARTITION BY LIST (column-name) VALUES ( value [ , value ]* ) [ , VALUES ( value [ , value ]* ) ]*CREATE TABLE FLIGHTS ( Partitioned Table Node 1 FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ….. PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER) ) PARTITION BY LIST (ORIG_AIRPORT) (VALUES („PDX‟, „LAX‟) Partitioned Table Node 2 VALUES („AMS‟, „DUB‟));
  • 18. Partition by Range To partition based on a range of values of a specific column :PARTITION BY RANGE (column-name) VALUES BETWEEN ( value AND value [ , VALUES BETWEEN ( value AND value ]* ) Partitioned Table Node 1CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , ….. Partitioned Table Node 2 PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER) ) PARTITION BY RANGE (FLIGHT_MILES) (VALUES BETWEEN 0 AND 100, VALUES BETWEEN 100 AND 500, Partitioned Table Node 3 VALUES BETWEEN 500 AND 1000 );
  • 19. Partition by Expression To partition on a derived value: PARTITION BY (expression) CREATE TABLE FLIGHTS ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , …. PRIMARY KEY (FLIGHT_ID, SEGMENT_NUMBER) ) PARTITION BY (HOUR(DEPART_TIME);
  • 20. Demo environment SQLFire Locator sqlf locator start -client-bind-address=loc1SQL client -client-port=1527 SQLFire server 1 sqlf server start -locators=loc1[10101] -locators=loc1[10101] SQLFire server 2 -client-bind-address=server1 -client-port=1528 SQLFire server 3 JMX agent sqlf agent start -locators=loc1[10101]
  • 21. Scaling withPartitioned tables
  • 22. Hash partitioning for linear scalability Key Hashing provides single hop access to its partition But, what if the access is not based on the key … say, joins are involved
  • 23. Hash partitioning only goes so far• Consider this query : Select * from flights, flightAvailability where <equijoin flights with flightAvailability> and flightId =‘AA1116;• If both tables are hash partitioned the join logic will need execution on all nodes where flightavailability data is stored• Distributed joins are expensive and inhibit scaling • joins across distributed nodes could involve distributed locks and potentially a lot of intermediate data transfer across nodes EquiJOIN is supported for only colocated data in SQLFire 1.0
  • 24. Partition aware DB design Designer thinks about how data access maps to logical partitions For scaling try to: 1) minimize excessive data distribution by keeping the most frequently accessed and joined data collocated on partitions 2) Collocate transaction working set on partitions so complex 2- phase commits/paxos commit is eliminated or minimized. Read Pat Helland’s “Life beyond Distributed Transactions” and the Google MegaStore paper
  • 25. Partition aware DB design – Identify partition key for “Entity Group” • "entity groups": set of entities across several related tables that can all share a single identifier – flightID is shared between the parent and child tables – CustomerID shared between customer, order and shipment tables CREATE TABLE FLIGHTAVAILABILITY ( FLIGHT_ID CHAR(6) NOT NULL , SEGMENT_NUMBER INTEGER NOT NULL , …) PARTITION BY COLUMN (FLIGHT_ID) COLOCATE WITH (FLIGHTS);
  • 26. Partition aware DB design Select * from Flights where flight_id = „UA326‟ Select * from Flights f, flightAvailability fa where <JOIN clause> and flight_id = „UA326‟ Select * from Flights f, flightAvailability fa where <JOIN clause> and flight_id IN („UA326‟, „AA400‟) Select * from Flights f where orig_airport = „SFO‟
  • 27. Partition Aware DB design• STAR schema design is the norm in OLTP design• Fact tables (fast changing) are natural partitioning candidates – Partition by: FlightID … Availability, history rows colocated with Flights• Dimension tables are natural replicated table candidates – Replicate Airlines, Countries, Cities on all nodes• Dealing with Joins involving M-M relationships – Can the one side of the M-M become a replicated table? – If not, run the Join logic in a parallel stored procedure to minimize distribution – Else, split the query into multiple queries in application
  • 28. APPLICATION DESIGN PATTERNS
  • 29. 1. “Write thru” Distributed caching “Write thru” – participate in container transaction Lazily load using “RowLoader” for PK queries Trade-off: Throttled by legacy database
  • 30. 2. Distributed caching with Async writes to DB Queues reside in memory redundantly & persistent on multiple nodes Primary / Secondary listeners Store-and-forward
  • 31. Demo Write behind to MySQL using the DBSynchronizer (AsyncEventListener)..33
  • 32. 3. As a scalable OLTP data store High throughput, response time, linear scaleRedundant copies, shared-nothing persistence, online backups Reduce maintenance cost and operational overhead
  • 33. 4. As embedded, clustered Java database Just deploy a JAR or WAR into clustered App nodes Just like H2 or Derby except data can be sync’d with DB is partitioned or replicated across the cluster Simply switch the URL from jdbc:sqlfire://myHostName:1527/ TO jdbc:sqlfire:;mcast-port=33666;host-data=true
  • 34. 5. To process app behavior in parallel Map-reduce but based on simpler RPC
  • 35. Scaling Application logic with Parallel “Data Aware procedures”
  • 36. Procedures Java Stored Procedures may be created according to the SQL StandardCREATE PROCEDURE getOverBookedFlights ()LANGUAGE JAVA PARAMETER STYLE JAVAREADS SQL DATA DYNAMIC RESULT SETS 1EXTERNAL NAME „examples.OverBookedStatus.getOverBookedStatus‟; SQLFire also supports the JDBC type Types.JAVA_OBJECT. A parameter of type JAVA_OBJECT supports an arbitrary Serializable Java object. In this case, the procedure will be executed on the server to which a client is connected (or locally for Peer Clients)
  • 37. Data Aware Procedures Parallelize procedure and prune to nodes with required dataExtend the procedure call with the following syntax: CALL [PROCEDURE] Client procedure_name ( WITH RESULTexpression ]* ] ) processor_name ] [ [ expression [, PROCESSOR [ { ON TABLE table_name [ WHERE whereClause ] } | { ON {ALL | SERVER GROUPS (server_group_name [, server_group_name ]*) }} Fabric Server 1 Fabric Server 2 ] CALL getOverBookedFlights( ) ON TABLE FLIGHTAVAILABILITY Hint the data the procedure depends on WHERE FLIGHT_ID = „AA1116‟; If table is partitioned by columns in the where clause the procedure execution is pruned to nodes with the data (node with “AA1116” in this case)
  • 38. Parallelize procedure then aggregate (reduce) CALL [PROCEDURE]register a Java Result Processor (optional in some cases): procedure_name [ WITH RESULT PROCESSOR processor_name ] [ [ ON TABLE table_name WHERE whereClause ( { expression [, expression[ ]* ] ) ]} | { ON {ALL | SERVER GROUPS Client (server_group_name [, server_group_name ]*) }} ] Fabric Server 1 Fabric Server 2 Fabric Server 3
  • 39. Demo Data Aware procedure demo41
  • 40. 6. To make data visible across sites in real time
  • 41. Consistency model
  • 42. Consistency Model without Transactions – Replication within cluster is always eager and synchronous – Row updates are always atomic; No need to use transactions – FIFO consistency: writes performed by a single thread are seen by all other processes in the order in which they were issued
  • 43. Consistency Model without Transactions – Consistency in Partitioned tables • a partitioned table row owned by one member at a point in time • all updates are serialized to replicas through owner • "Total ordering" at a row level: atomic and isolated – Membership changes and consistency – need another hour  – Pessimistic concurrency support using „Select for update‟ – Support for referential integrity
  • 44. Distributed Transactions • Full support for distributed transactions • Support READ_COMITTED and REPEATABLE_READ • Highly scalable without any centralized coordinator or lock manager • We make some important assumptions • Most OLTP transactions are small in duration and size • W-W conflicts are very rare in practice
  • 45. Distributed Transactions • How does it work? • Each data node has a sub-coordinator to track TX state • Eagerly acquire local “write” locks on each replica • Object owned by a single primary at a point in time • Fail fast if lock cannot be obtained • Atomic and works with the cluster Failure detection system • Isolated until commit for READ_COMMITTED • Only support local isolation during commit
  • 46. Parallel disk persistence
  • 47. Why is disk latency so high?• Challenges – Disk seek times is still > 2ms – OLTP transactions are small writes • Flushing to disk will result in a seek • Best rates in 100s per second• RDBs and NoSQL try to avoid the problem – Append to transaction logs; out-of-band writes to data files – But, reads can cause seeks to disk
  • 48. Disk persistence in SQLF Memory Memory Tables Tables LOG LOG Compressor Compressor OS Buffers OS Buffers Record1 Record1 Record1 Record2 Record2 Append only Record1 Record2 Record2 Append only Record3 Record3 Operation logs Record3 Record3 Operation logs• Parallel log structured storage • Don‟t seek to disk• Each partition writes in parallel • Don‟t flush all the way to disk• Backups write to disk also – Use OS scheduler to time write – Increase reliability against h/w loss • Do this on primary + secondary • Realize very high throughput
  • 49. Performance benchmark
  • 50. How does it perform? Scale?• Scale from 2 to 10 servers (one per host)• Scale from 200 to 1200 simulated clients (10 hosts)• Single partitioned table: int PK, 40 fields (20 ints, 20 strings)
  • 51. How does it perform? Scale?• CPU% remained low per server – about 30% indicating many more clients could be handled
  • 52. Is latency low with scale?• Latency decreases with server capacity• 50-70% take < 1 millisecond• About 90% take less than 2 milliseconds
  • 53. Thank you: You can reach us at … Jags Ramnarayan: jramnara@vmware.com Guillermo Tantachuco: gtantachuco@vmware.comhttp://communities.vmware.com/community/vmtn/appplatform/vfabric_sqlfire Q&A