TerraAlto technical talk on DevOPs approach to Data Warehousing with Redshift and Lambda. Talk given by Senior Data & Database Engineer, Andras Gombosi, at the AWS Enterprise Briefing, Dublin, Ireland on Wednesday, 28 November 2018.
Hello, my name is Andras Gombosi. I am a senior Data- and Database Engineer at TerraAlto.
We are a Dublin based well established technical consultancy focusing solely on AWS. We are an AWS Advanced Consulting Partner. We are also an AWS Managed Services Provider, members of an elite group of 126 companies worldwide having this competency.
We are serving clients of all sizes from start-ups to truly global enterprises, we have countless migrations under our belt in Europe, Asia and also in AWS China. We are also working on various projects in the space of Big Data, IoT- and Data Lakes and Blockchain based track-and-trace solutions.
One of our core operating principles is automation. The topic I have brought today is automation in a place where automation is not as widespread yet. Data Warehousing and BI development.
The ongoing rumour is that Redshift has been named to mark that we are moving away from something Red… I have worked with those Red technologies for nearly a decade, but since I have also made a shift.
Redshift in physics happens when light undergoes an increase in wavelength. This phenomenon is directly related to the expansion of space, the expansion of the universe.
Redshift is an exceptionally good service for corporate data warehouseing, both as a standalone DWH and as a SQL-compatible extension of a corporate data lake. As usual , AWS does most of the heavy lifting, but Data security and cluster performance however great care and attention from the customer side as well.
The capabilities of Redshift to grow make organisations capable of having a single, true enterprise Data Warehouse, typically queried, developed and modified by multiple, often geographically distributed teams and processes, in some cases hundreds having some sort of access to it.
Developers and Data Engineers modify data and change structure in Data Marts Data Analysts query data directly DBA’s change Data Security (grants and revokes) and do housekeeping (VACUUM, ANALYZE) ETL processes (Glue, EMR, Matilion, Informatica) constantly insert and update Front end BI tools (QuickSight, Tableau, Microstrategy, SpotFire) query data through data marts
Control? The challenges are not new, just a bit amplified, again because of the size, and because of the open source origins of Redshift, as open source solutions are typically surrounded by a tooling ecosystem, which is not present on Redshift right now out of the box.
SLA’s on Data Availability, and Uptime of data marts or other data sources for the upstream consumers. That means ETL/ELT jobs are running in a timely and performant manner, and BI teams and other upstream consumer tools can connect and query without any disruption.
Security. In this case security of the Data itself. Who has access to what?
Audit and Compliance. Who changed what exactly and when?
In a complicated environment it is vital to have formal, automated processes without human intervention, otherwise due to the sheer scale the proper management of these challenges become very time-consuming, and sometimes near impossible.
One possible solution is a “DevOps” style governance framework.
Yes, bringing database changes under the DevOps umbrella is an increasingly popular topic. There are many tools and many ways to build a pipeline, some of them pricy, some of them complicated, some of them only work with specific DB engines, and some of them are all three of these.
Nevertheless, the principles are the same for a Redshift CD pipeline too.
A Code repository for code version Control and audit is the entry point, triggering an event driven , automatic , intelligent Continuous Deployment capability Ideally this is accompanied with an in-cluster Database and Schema based User- and Privilege Management Framework which is controlling access via user groups, dedicated service users and default privileges.
The solution I brought today is a BASIC, practically free Cloud- and AWS Native way to get going. It does not use anything, only AWS and Python.
Code Commit is the starting point. Multiple communities use separate Repositories, and different branches are set up. Some branches are protected, cannot be directly pushed into, only via Pull Requests and Merging.
Code is being pushed or merged to the appropriate branch triggers a task router
Task Router: Can be CodePipeline with a Lambda as custom action for the Build stage to execute anything on a database. For most organizations Lambda might be better suitable. Your mileage may vary. We are using Lambda for this step too.
Task router understands information about the commit and evaluates requests. The Commit message, for example, for the order you want to run your SQL files, or the routing information. i.e. flag your commit if it has a big task to route it towards a container instead of Lambda (Limitation here is 15 minutes execution time.)
Two major types of long running executions
ETL COPY’s and UNLOAD’s , CREATE TABLE AS’s are usually done by an ETL / ELT tool (Glue, Matillion, Informatica, ) HOUSEKEEPING VACUUM / ANALYZE. Some ETL tools are also capable of scheduling these operations.
Big Job deployer is entirely optional in most cases, depending of what other tools are available already in-house
A few examples of possible use cases apart from normal development work.
Anything which is can commit a SQL file to a repository can utilize the framework.
Automatic, controlled , central deployment of generated scripts forward engineered from a database modelling tool, be it a full new schema deployment or incremental deltas to structure. No more “Oops” situations where someone have accidentally dropped a few and broke another few views on Production instead of Dev just because he started to work before the third coffee.
DBSchema, Aqua, Aginity, whatever is your weapon of choice. If the tool has git integration, it will work seamlessly with the PipeLine.
If a Database team gets to a higher capability maturity level and the company can justify purchasing more complicated and potentially pricy Database Release Management software solutions, the SDLC might be changing again, but up until then…
What we frequently see is that more and more customers want to have almost real-time visibility of their AWS costs. AWS provides a neat extract mechanism which dumps the billing data into an S3 bucket, hourly if required. But good guy AWS not only dumps the raw data, it also dumps the SQL Commands and Manifest files for loading the raw csv’s to Redshift A trigger on the appropriate S3 put can start a function which picks up the event, makes minor changes to the loader SQL file (adding Redshift target schema for example), and commits the modified SQL to a repository monitored by a pipeline.
Similar approaches can work very well even in certain Data Lake scenarios, or if you make the loader SQL and manifest part of your interface contract between systems, a deliverable with pieces of data.
TRIGGER -In a newly pushed commit, following info is getting automatically forwarded to Lambda in the trigger event - WHO - WHEN - Which repo - Which branch - Commit ID
- Most of the work is done by a Lambda function, written in Python. Boto is an incredibly convenient and elegant tool to create integration between AWS services.
Retrieves commit details and code from Code Commit based on Commit ID Retrieves additional config from DynamoDB, such as hooks for Slack or Teams , Redshift host, and target database and schema. Retrieves appropriate Secrets from Secrets Manager. You will have to have a naming convention in place , [repo-branch] combo works fine. Executes code against Database Initiates notifications, Slack Hook, MS Teams Hook , basically anything supporting CURL / HTTP hooks, or email Exact setup depends on networking setup including Lambda networking, client preferences and existing messaging platform usage and integration capabilities.
Logs Everything in CloudWatch
It will be a change, especially for teams at the low end of the Capability Maturity Model, but a crucial change, and that is exactly the point!
Improved Code Quality, "lot of tools try to differentiate themselves with automatic code review capability" In the real, complicated world it is not always that simple that it can be codified, otherwise the DBA work would not have to be black magic! And there are other options, such as the new Redshift Recommendations, or clever monitoring of certain STL and STV views, sometimes in combination with alerting on a Kibana dashboard.
Skills -> Pull Requests protected branches-> 4 eye checks Console provides easy access to relatively advanced GIT features, which is important, database development teams are traditionally a little bit behind in terms of experience in DevOps
A human-to-human knowledge transfer is built in the deployment process, which automatically encourages Growth in both Team Maturity and individual developer skills, and Redshift Performance.
Quality Many SQL statements can be scripted in an IDEMPOTENT way, so many scripts will be re-runnable.
UPTIME The main effect is a much Improved, undisturbed Availability of Data for end-user facing BI tools. Breaking a Data Mart via an incorrect VIEW definition is now much harder. This leads to and trust in the IT team. Increased Customer satisfaction
Overall Data Security enforced by automatic processes on every level, including auditing and traceability.
Multi-layer security is present.
VPC (This was yesterday -> Re:Invent happened while I was sleeping) - closed VPC with Service Endpoints wherever it is possible (ENI or NAT setup might be required, but a seasoned SA should breeze through these) Executor Lambda running in closed VPC, which has S3 and Secrets Manager endpoint, also Redshift enhanced VPC routing is on. Code Commit has no VPC endpoints available yet, and also in AWS China there is no CodeCommit . Companies having very strict security requirements such as data (including code) cannot travel on the open internet even if encrypted still have choices, hosting GIT on an EC2 instance within closed VPC. Triggering the executors might require manual setup of the hooks.
IAM provides full lock-down capabilities on both Infrastructure and Services and Resource level - Bespoke Lambda and any other service execution / resource roles - Bespoke CodeCommit users and groups for engineers and senior / approver group
Directory Service Working in Federation with IAM for Single Sign-On Console access, for example to facilitate Pull Request reviews and merges. The client controls access levels via AD Groups.
Redshift Redshift -> in-database user management framework with service users the Lambda executors are utilising, and pre-configured upstream user groups. Many of our clients DO NOT even have credentials for any Redshift user accounts with elevated privileges, such as Schema owners or Superusers.
Reliability Lambda scales horizontally, Automatic burst 500 – 3000 (bigger regions, such as Ireland) Scaling on Code Commit and Secrets Manager are managed by AWS Just as on Fargate and ECS
Operational Excellence Deployment of CD pipelines via Parameterized CloudFormation templates, infrastructure as a code Lambda : Retry functionality and Dead Letter Queues, optionally AWS Step functions for an extra layer of state management CloudWatch and X-Ray Notifications on functional DB code failures to Dev teams via Slack / Teams notifications Notifications and alerting on Infra level problems to SysOps teams via CloudWatch and DLQ
Performance Efficiency Rightly sized Lambdas and rightly sized, configured containers for the Infra Power users are using repositories which are connected to dedicated Redshift users with access to Superuser / dedicated WLM queues DynamoDB -> autoscaling might be an overkill, depends on the size of the dev teams and branches to manage, but the main thing is that the load is measurable and the functionality is there to auto-scale if required.
Cost Optimization The beauty is, that this is practically free to run once you build it, cost is insignificant if there is any. Lambda , Code Commit , Triggering, CloudFormation, all the nice tools are being made available free of charge or very cheap. Minimal cost associated with the “Big Job” The SQL code itself runs on the Redshift clusters!
Also, not just for Redshift.
I believe the power of the AWS EcoSystem is evident, multiple cloud-native services working perfectly in concert to create an automated, event-driven, efficient, secure and scalable solution to a challenge. AWS is the perfect place for thinking “outside the box”
Next Generation Data Warehouse Development with Lambda and Redshift
with Lambda and
Petabyte scale Massively Parallel
Exceptionally fast *
Massive Storage capacity
Attractive and transparent pricing
Challenges – the cost of greatness
Audit & Compliance
Empower the Database
Developer and DBA
communities with DevOps
CD Pipeline for DB code
Data Modeller SDLC: Forward Engineering
Edge ETL Use Cases: AWS Billing Data Load