Quotient Technology Inc. is a digital commerce platform, that offers digital coupons and media solutions serving hundreds of CPGs, such as Clorox, Procter & Gamble, General Mills and Kellogg’s, and retailers like Albertsons Companies, CVS, Dollar General, and Walgreens, and U.S. households. Quotient delivered more than 3.5 billion digital coupons in 2017, which equals about 6,747 coupons each minute. MariaDB is a critical component in Quotient's stack, helping to power their modern offer management system by supporting the most complex offer types and personalized shopper experiences. In this session, the Quotient team discusses:
- Complex offer types, payload scenarios and how MariaDB helps Quotient scale
- MariaDB architecture, configuration and replication across multiple active datacenters
- Retailer data and security, and how MariaDB native data-at-rest encryption helps
4. Desktop
Touch Pad
Mobile
In Store
Machine
Learning
Near RealTime,
Personalized to Each
Audience and
Touchpoint with Speed
& Scale
450M
Digital Personalized
Recommendations
per Day
Quotient Artificial Intelligence: Rich
Data for precision targeting at scale
across multiple touchpoints
We are here to talk about our journey with MariaDB on
how we help 60 - 70% of the US households save money in Grocery Shopping. Pretty big scale serving most of grocery retailer chains in US, very low latency expectations for transactions originating from POS.
I would be cop resenting along with my colleagues Radha Krishnan , Senior Engineering Architect and Ashu Sahijpal Principal Database Architect.
Let me first start with a question here. How many people here , use or always look for a coupon or feel it would be good to have coupon during grocery purchase. This could be a FSI (Free standing insert coupon-paper coupons) or a digital coupon which you activate though a retailers loyalty program app like Albertson’s J4U, Target or Kroger Apps?
Wow, every one likes to save money. this is a good healthy market.
Or, this is definitely a group we should targeted to market our products. Mark my words , you will be surprised to save atleast 5 to 10% for basket size of $50. Who doesn’t need a $8 savings for claritin 60ct ,$3 off for Tide laundry detergent, if it is personalized , reaches you when you need it, then why no
To give an idea of how much money is saved every year in couponing:~641 billion dollar sales and in the same year ~293 billion coupons were distributed of value ~573 billion dollars, however, only 2.75 billion dollars worth of coupons were actually redeemedIn a recent survey …
Coupons are a driving force in commerce that connects people to brands.
Our platform connects shoppers, brands and retailers spread from east to west coast.
Our Engineering team consists of software developers, data scientists, and quality assurance engineers dedicated to building and operating our suite of innovative and scalable solutions that enables our overall mission of helping our consumer packaged goods and retail partners to increase sales.
Digital personalized recommendation of offers at every customer touch points like Desktops, Touchpads, mobile, Point of sale needs to happen in milli seconds. 450 million recommendations per day, 14000 peak offer activations per min are handled, Milli seconds transaction speed for POS integrations.
At a billing register, When a redemption transaction timeouts at POS, that consumer losses money , potential for that retailer to loose the customer due to bad customer experience. Technology at scale becomes very important here.
i welcome RK and Ashu to be taking about
How does MariaDb and relevant technologies play a important role to achieve this scale?
Offer type modeling
database architecture, monitoring and our learnings.
I welcome RK to
Like Arun mentioned, Quotient is the leading Consumer Package Goods marketing technology ecosystem that delivers data-powered personalized digital promotions and media across various channels like mobile and web including reporting and analytics I am going to focus on one of my favorite use case which is our digital promotions
We start a marketing promotion from Grocery Manufacturers like P&G and distribute these coupons across all retailers like Safeway or Walgreens
Why is it important: We built an Anti stacking service for these high value offers that can limit redemption once per device or phone number etc.,
One of the several architectures is our Digital Promotions Architecture:
Where the coupons user open the app, clips digital coupons and does a checkout using phone or loyalty card. The retailer uses our couponing services to lookup the user and the coupons clipped and gives instant discounts redeeming the coupons directly at the POS without having the user to remember to bring that old paper coupon. Overall, this stack serves several million API calls per hour realtime
This simplified microservices architecture shows our end to end flow of transactions in one of our private data centers using traditional Lambda Architecture for both real time and batch processing of all coupon clips and redemptionsIn each of the data center, the APIs get routed through the LB -> Apache WS -> Gateway (authentication) -> Rest/microservices
All of our microservices are either Spring boot and Spring MVC based and we constantly experiment with the latest/greatest stacks, we use a distributed cache to lookup Coupon metadata and coupon publishing metadata but MariaDB is our primary source of this metadata.We write the transactions realtime to Cassandra and near real time asynchronously to MariaDB which is strongly consistent for adhoc analytics
Our microservices use JPA Hibernate entity mapping libraries to read/write from MariaDB and to also run cascading updates across multiple tables atomically. MariaDB connections use an optimized connection pooling library to use persistent connections to the databasesThe transactions go through the messaging pipelines for batch processing and for downstream systems
About 5 years ago, we were early adopters of Cassandra and found that they don't fit several use cases so we really needed an RDBMS database
…
And support for multi data center replication and clustering and MariaDB was the best choice
I will walk through some of the use cases where MariaDB shined
We have a wide variety of coupons but they are limited to under a million active coupons so we can represent them in a normalized way to support various use cases and also be ACID compliant because the coupon metadata includes budget and transaction limits
For example, when you buy a bag of chips, you get an instant 2$ off or percent off on the specific product purchased using an Amount Off coupon
We also have automatic accumulation of eBox top offers sthat you see on cereals, this way, you don’t have to cut it and send it to your school
Besides these coupons, we also have programs like this one where you buy or spend 20$ specific Colgate products across several trips in a month and get rewarded
Each of these coupons have a strict lifecycle on when it should start, end, be able to clip/redeem at the POS, very frequently, we want to shut them off with strong consistency as it reached a budget limit
Like I said, this normalized coupon data modeling is spread across 50+ tables to represent coupon metadata, I am only showing some of the important ones hereBut, this modeling allowed us to customize query several thousands of records by joining several tables at runtime to filter by retailer, zipcode, category and also paginate using MariaDB in under hundred ms
Offer itself has a lifecycleOffer and its promotion - multiple types of offers can be grouped as a promotion, and how it is published to a partner or retailer like Safeway.
And on the right side, we have different conditions required to be eligible to get different types of reward which is the actual discount savings you will get
Our APIs originally supported extremely long lived persistent tokens over HTTPs
With OAuth 2.0, our authorization server needed to issue short-lived access tokens (and a long-lived refresh tokens) and MariaDB worked really well with its cross DC replication (under a hundred ms?). Although the tokens are cached for short times in Redis, the tokens are still served from MariaDB whenever there is a traffic switch or cache expires allowing the APIs to continue using those short lived access tokensThe expired tokens are deleted in batches which is again not supported in no-sql databases due to tombstones
NOTES: we have 2 hour TTL on short lived tokens
Sometimes we are required to target specific high value coupons to specific users that no body else can use and this requires running massive bulk inserts and updates to bring them live or shutoffThe user typically sees these high value coupons on the top after he logs in.
All these millions of records can be updated in one update or even deleted in MariaDB very efficientlyEmail campaigns
One of the business use case is to execute complex business rules based on the date or count for example which means we need to keep track of track of counts for each coupon periodically, updating counts atomically and run rules that basically do some sort of rate limitingWe have a batch job that runs frequently and aggregates and updates counts per hour in a table.MariaDB really fast at aggregate queries and updating the tables atomicallyNo-SQL databases are not designed for analytics and aggregations
Like I said, we use MariaDB for periodic counts for every hour for reconciliation and running some other business rules. But, most of the analytics happens from our BI/data warehouse system
We only store an years worth of transactions for quick analytics
In Cassandra, you design based on the query usage patterns whereas in MariaDB you can index and model to query with any patternWithout further Ado, I want to introduce Ashu who is our DB architect and how they keep MariaDB up and running!!
In this presentation I’ll go over how we use MariaDB. We’ll talk about some of issues we are going through and how we have tried to address those.
I’ll continue to focus more on Digital promotion which Krishna was talking about. This is one of the several clusters we use.
Our Digital Promotions cluster shared by many Partners, it is configured as multi-master across east and west data centers. There is 10G VPN tunnel between two datacenters.
We have failsafe mechanism configured on LB. probes run across datacenters on master databases. If a master database is down, all the traffic is moved to another datacenter. We also have ability to split and move partners between DC. Our operations team can quickly distribute 0-100% traffic between datacenters. We are able to handle peak traffic in one DC. This setup has given us ability to quickly move traffic during maintenance, upgrades or during outages.
There are two masters where application Writes. Third master is used for data ingestion. We don’t have much control over ingestions as these are automated and initiated by some of our partners. Our application is designed to handle writes to the both masters at same time.
Both DC have almost similar setup. Application reads round-robin from db03, 04 and 05. It uses read only VIP. db10 on east DC is used by users to run adhoc queries and ETL jobs. This also serves as a standby server for ingestion jobs. We also use Gemfire to cache data. These queries on run on RO VIP.
Backups are happening on Standby masters. Backup is done locally on NetApp storage mount. It is copied over to Cloud storage managed by third party vendor. Binlogs are also copied locally and to cloud storage every 30 minutes.
databases are across 4 DCs for some applications which are running in cloud, Writes are happening on east and west on-prem databases, These changes are propagated to read only databases in cloud.
We have few Galera clusters across DCs. Application uses Node 2 and 3. Node 1 is reserved for replication, ETL, batch jobs, backups.
Quotient Techology is around for 20 years. We have been using different database technologies like Sql Server, MongoDB, Redis, cassandra. We are trying to simplify this by using MariaDB for most of new developments. At the same time we are working on migrating some legacy applications to MariaDB.
We use GTID on most of the clusters.
~4000 writes/ updates and ~11000 reads/ sec at peak. These are client requests doesn’t include other routine tasks we do like purging, data ingestions, ETL. We partition large tables, easier to drop table partitions. We try to keep production databases as small as possible. Much easier to do maintenance, backup/ restore on smaller databases.
We log slow queries which runs more than 100ms. There is an automated process in place for sending weekly slow queries to Developers/ Database team.
MonYOG which is known as SQL diagnostic manager now, is configured for performance monitoring and sending alerts. It might be a single point of failure if MonYOG server is down or unavailable. To mitigate this we also use nagios for system and basic database monitoring. We have another performance monitoring tool which we use in conjunction with Monyog, I’ll talk about this in next slide.
As we all know, backups are must. More important is to validate these. We have a VM in each environment to automatically restore and run some scripts to validate data. Delayed replication was introduced in 10.2.3, we are configuring one of slave database in each environment to be running 12 hours behind.
We use Data At Rest Encryption for SOC2 compliance.
In addition to MONyog, we use in-house Grafana/ Graphite for database and system monitoring. We built this because some of the feature we wanted were not there in other monitoring tools. We plot multiple nodes on the same graph which helps us to identify poor performing node.
Sometimes it is not possible to capture all the expensive queries in slow query log or show processlist. These graphs help to identify the cause very quickly.
I would like to mention that we have enabled userstat plugin in my.cnf. The value can also be turned off and on dynamically. These helps estimating table, user and client usage.
Here are some of system stats we capture every minute and compare among different nodes. These are from iostat, systats, vmstats. In addition to this we also gather many database stats. As you see there is a peak for few minutes, we’ll try to figure out why this happened using the graphs. This is real time scenario which happened couple of weeks ago.
Let us look at these table statistics graphs. We don’t use these complicated table names, these are changed for demo. These graphs indicates a deletion job ran for an hour and deleted approx 4 million rows. Below are queries we run to capture these graphs. So far we are tracking top 10 active tables. #active tables differ in each environment. In past we have identified some of caching server queries running on Master databases, we have moved those to slave databases.
We also monitor some user and client related stats. Like connections, cpu time, row, bytes sent/ receive. What kind of commands like select, update, delete coming from a client. This is helpful to know who is using most database resources in a multi-tenant database.
We are running some of MariaDB instances on large servers. Have noticed swapping issues. This is due to non-uniform use of memory on cpu nodes by the mysqld process. 10.2.3 onwards have NUMA enabled builds. innodb_numa_interleave=1
For SOC2 compliance, we encrypt PII data. We did extentensive load test on encrypted database. For our workload we saw ~8% performance degradation. Tried to optimize queries, database, server to gain every bit of performance.
Our data consumption is growing, Lately we have been noticing issues with bulk load on Galera cluster.
We follow agile model and production pushes happening twice a week. To keep up with all these changes we have tried to automate as much as we can. Capturing processlist every minute, turning on userstats, monitoring slow queries have been a great help.
To ensure optimal performance, we have configured CPUs to use the performance governor, which locks the frequency at maximum. This governor will not switch frequencies, which means there will be no power savings but the servers will always run at maximum throughput.
To get optimal write performance we use battery backed RAID. We also monitor health of batteries through nagios.
>> Without the battery, RAID Arrays couldn't do caching without risk of data loss during a power failure.
Currently we etl million of rows to Hadoop and impala for analytics, We are evaluating if we can get benefit from MariaDB column store.
We manage failsafe/ spillover through firewall, MariaDB Maxscale seems promising for load balancing and high availability functionality.
We have been using Kafka in few applications, We can leverage Kafka streaming with Maxscale CDC (Change Data Capture) for our Analytics.
We are running few smaller MariaDB clusters in Cloud. Doing loadtesting/ bench marking for large clusters.
Thats all from us. Thank you for your time!