3. Claudio Pontili
10+ years of experience on AWS
Senior Cloud Solution Architect
AWS Authorized Instructor Champion
Claudio.pontili@besharp.it
https://www.linkedin.com/in/claudiopontili/
4. Agenda
• Using Lambda for ETLs
• Glue ETLs
• CI/CD to deploy code inside Lambdas and Glue Jobs
• Datawarehousing on Aurora Serverless v1
• A full serverless Big Data Architecture
• What we’ve learned
5. Using Lambda for ETLs 1/2
• You can use Python + Panda library
• A lambda can have 10 GB of memory and a lot of CPU
power
• A lambda can run for 15 minutes
• Max deployment package 50 MB (zipped)
• Container image code package size 10 GB
• /tmp directory storage 512 MB
7. Glue Jobs, Data Catalog and Crawler 1/2
• Fully managed Data Catlog and Extract-Transform-Load (ETL) service
• Automates data discovery, conversion, mapping and job scheduling
• Glue runs your ETL jobs in an Apache Spark serverless envinronment
• Allow to scale your ETLs jobs
• Can easily schedule a crawler to to create a catalog of files stored on
S3
• Too much code? Try Glue Databrew
11. RDS Aurora Serverless
• MySql and PostgreSQL supported (reuse
the experience of your team)
• Pay per ACU/hours (2 GB of memory)
• Scales from 1 to 256 ACU
• You can pause the cluster during the night
• Aurora Serverless v2 in preview for
MultiAz, Read-Replicas, faster scale
13. What we’ve learned
• Serverless gives you High Availability and
great scalability with no effort
• Pause Aurora Serverless v1 (it will take
about 30 seconds to restart)
• Use IaC (Cloudformation, Terraform, CDK,
etc) to deploy your infrastructure
• Tune your lamba memory using
https://github.com/alexcasalboni/aws-
lambda-power-tuning
• S3 is cheap but try not to write tiny
(<128KB) files
• Serverless can be pretty cheap if it’s used
in the right way