Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Curriculum Associates Strata NYC 2017

1,172 views

Published on

Overview of Curriculum Associates' use of MemSQL

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Curriculum Associates Strata NYC 2017

  1. 1. Real-time Application Architecture David Mellor VP & Chief Architect Curriculum Associates
  2. 2. Building a Real-Time Feedback Loop for Education David Mellor VP & Chief Architect Curriculum Associates Adjusted title to match abstract submission
  3. 3. • Curriculum Associates has a mission to make classrooms better places for teachers and students. • Our founding value drives us to continually innovative to produce new exciting products that give every student and teacher the chance to succeed. –Students –Teachers –Administrators • To meet some of our latest expectations, and provide the best teacher/student feedback available, we are enabling our educators with real-time data. 3 Our Mission
  4. 4. •The Architectural Journey •Understanding Sharding •Using Multiple Storage Engines •Adding Kafka Message Queues •Integrating the Data Lake 4 In This Talk
  5. 5. The Architectural Journey 5
  6. 6. 6 The Architecture End State iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Nightly Load Files Brokers ZooKeeper Lesson iReady Event Event System MemSQL Reporting DB MemSQL Reporting DB
  7. 7. 7 Our Architectural Journey • Where did we start and what fundamental problem do we need to solve to get real-time values to our users? iReady Lesson iReady Event Scheduled Jobs ETL to Data Warehouse Data Warehouse Reporting Data Mart ETL to Data Mart
  8. 8. 8 Start with the Largest Aggregate Report Our largest aggregate report consists of logically: –6,000,000 leaf values filtered to 250,000 –600,000,000 leaf values filtered to 10,000,000 used as the intermediate dataset –Rolled up to produce 300 aggregate totals –Response target 1 Sec 6,000,000+ Students 600,000,000+ Facts A District Report: 10,000,000 Facts rolled up and into 300 schools SID DESC ATTR1 SID FACT1 FACT2 . . .
  9. 9. • SQL Compatible – our developers and basic paradigm is SQL • Fast Calculations – we need to compute large calculations across large datasets • Fast Updates – we need to do real-time updates • Fast Loads – we need to re-load our reporting database nightly • Massively Scalable – we need to support large data volumes and large numbers of concurrent users. • Cost Effective – we need a practical solution based on cost 9 In-Memory Databases MemSQL • Columnar and Row storage models provides for very fast aggregations across large amounts of data • Very fast load times allows us to update our reporting db nightly • Very fast update times for Row storage tables • Highly scalable based on their MPP base architecture • Unique ability to query across Columnar and Row tables in a single query
  10. 10. • Convert our existing database design to be optimal in MemSQL • Analyze our usage patterns to determine the best Sharding key • Create our prototype and run typical queries to determine the optimal database structure across the spectrum of projected usage –Use the same Sharding key in all tables –Push down predicates to as many tables as we can 10 Our MemSQL Journey Begins
  11. 11. Understanding Sharding 11
  12. 12. 12 Why is the selection of a Sharding key so important? SID DESC ATTR1 SID FACT1 FACT2 . . . NODE2 NODE3 NODE1 Create database with 9 partitions Create tables in the database using a sharding key which is advantageous to query execution The goal is to get the execution of a given query to be as evenly distributed over the partitions PS1 PS2 PS3 PS4 PS5 PS6 PS7 PS8 PS9
  13. 13. 13 How does the Sharding Key affect the “join” PS1 PS2 Node 1 Select a.sid, b.factid from table1 a, table2 b Where a.sid in {10 ….. } and b.sid in {10 ….. } And a.sid = b.sid The basis of the join is on the sid column. When the sharding key is chosen based on the sid Columns for both tables … the join can be done Independently within each partition and the result Merged This is an ideal situation to get the nodes performing In parallel which can maximum query performance
  14. 14. 14 How does the Sharding Key affect the “join” PS1 PS2 Node 1 Select a.sid, b.factid from table1 a, table2 b Where a.sid in {10 ….. } and b.sid in {10 ….. } And a.sid = b.sid When the sharding key is not based on the sid Columns for both tables … the join cannot be done Independently within each partition and will cause what Is called a broadcast This is not the ideal situation to get the nodes performing In parallel and we have seen query performance degredration In these cases
  15. 15. Using Multiple Storage Engines 15
  16. 16. • Row storage was the most performant for queries, load and updates – this is also the most expensive solution • Columnar storage was performant for some queries and load but degraded with updates – cost effective but not performant enough on too many of the target queries • To maximize our use of MemSQL we have combined Row storage and Columnar storage to create a logical table –Volatile (changeable) data is kept in Row storage –Non-Volatile (immutable) data is kept in Columnar storage –Requests for data are made using “Union All” queries 16 Columnar and Row Storage Models
  17. 17. 17 Columnar and Row . . . SID FACT1 FACT2 Active ? ? ? Row Storage Portion Columnar Storage Portion Logical Table SID FACT1 FACT2 Active n n n n n n n . . . n n n SID FACT1 FACT2 Active ? n n ? Select sid, fact1, fact2 From fact_row Where sid in (1 …10) Union All Select sid, fact1, fact2 From fact_columnar Where sid in (1 …10)
  18. 18. Adding Kafka Message Queues 18
  19. 19. 19 Dispatching The Human Time Events Events iReady Kafka HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System • We had a database engine that could perform our queries • We solved our cost and scaling needs • We proved we could load and update the database on the desired schedule • How are we going to get the real-time data to the Reporting DB?
  20. 20. 20 Dispatching The Human Time Events Events iReady Kafka HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON Event: Payload: JSON MemSQL Pipeline • Use MemSQL Pipelines to ingest data into MemSQL from Kafka • Declared MemSQL Objects • Managed and controlled by MemSQL • No significant transforms
  21. 21. • Tables are augment with a column to contain the event in JSON form • All other columns derived 21 Kafka and MemSQL Pipelines SID FACT1 FACT2 SID FACT1 FACT2 event CREATE TABLE realtime.fact_table (event JSON NOT NULL, SID as event::data::$SID PERSISTED varchar(36) FACT1 as event::data::rawFact1 PERSISTED int(11) FACT2 as event::data::rawFact2 PERSISTED int(11) KEY (SID)) create pipeline fact_pipe asload data kafka '0.0.0.0:0000/fact-event- stream'into table realtime.fact_table columns (event);
  22. 22. 22 Adding the Nightly Rebuild Process iReady Confluent Kafka Debezium Connector (DB to Kafka) HTC Dispatch MemSQL Reporting DB Brokers ZooKeeper Lesson iReady Event Event System • Get the transactional data from the database • Employ database replication to dedicated slaves • Introduce the Confluent platform to unify data movement through Kafka • Deploy the Debezium Confluent Connector to move the replication log data into Kafka
  23. 23. Integrating the Data Lake 23
  24. 24. 24 Create and Update the Data Lake iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch MemSQL Reporting DB Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Build a Data Lake in S3 • Deploy the Confluent S3 Connector to move the transaction data from Kafka to the Data Lake • Split the Data Lake into 2 Distinct forms – Raw and Read Optimized • Deploy Spark to move the data from the Raw form to the Read Optimized form
  25. 25. 25 Move the Data from the Data Lake to MemSQL iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch MemSQL Reporting DB Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Deploy Spark to transform the data from the Read Optimized form to a Reporting Optimized form • Save the output to a managed S3 location • Deploy MemSQL S3 Pipelines to automatically ingest the nightly load files from a specified location • Deploy MemSQL Pipeline to Kafka • Activate the MemSQL Pipeline when the reload is complete Nightly Load Files MemSQL Reporting DB
  26. 26. 26 Swap the Light/Dark MemSQL DB iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Brokers ZooKeeper Lesson iReady Event Event System • Open up the Dark DB to accept connections • Trigger an iReady application event to drain the current connection pool and replace the connections with new connections to the new database • Close the current Light DB Nightly Load Files MemSQL Reporting DB MemSQL Reporting DB MemSQL Reporting DB MemSQL Reporting DB
  27. 27. 27 The Architecture End State iReady Confluent Kafka Debezium Connector (DB to Kafka) S3 Connector (Kafka to S3) HTC Dispatch Data Lake Raw Store Read Optimized Store Nightly Load Files Brokers ZooKeeper Lesson iReady Event Event System MemSQL Reporting DB MemSQL Reporting DB
  28. 28. • Ensure the system you are considering is up to the challenge of your most sophisticated queries • With distributed systems, spend time to pick the right sharding strategy • Make use of multiple storage engines where available • Design workflows with message queues for flexibilty and update-ability • Incorporate data lakes for long term retention and context 28 Key Takeaways

×