AWS SummitEMEA
Claudio Pontili
AWS Senior Cloud Solution Architect
Claudio.pontili@besharp.it
AServerless approachtoBigDataonAWS
Claudio Pontili
10+ years of experience on AWS
Senior Cloud Solution Architect
AWS Authorized Instructor Champion
Claudio.pontili@besharp.it
https://www.linkedin.com/in/claudiopontili/
Agenda
• Using Lambda for ETLs
• Glue ETLs
• CI/CD to deploy code inside Lambdas and Glue Jobs
• Datawarehousing on Aurora Serverless v1
• A full serverless Big Data Architecture
• What we’ve learned
Using Lambda for ETLs 1/2
• You can use Python + Panda library
• A lambda can have 10 GB of memory and a lot of CPU
power
• A lambda can run for 15 minutes
• Max deployment package 50 MB (zipped)
• Container image code package size 10 GB
• /tmp directory storage 512 MB
Using Lambda for ETLs 2/2
Glue Jobs, Data Catalog and Crawler 1/2
• Fully managed Data Catlog and Extract-Transform-Load (ETL) service
• Automates data discovery, conversion, mapping and job scheduling
• Glue runs your ETL jobs in an Apache Spark serverless envinronment
• Allow to scale your ETLs jobs
• Can easily schedule a crawler to to create a catalog of files stored on
S3
• Too much code? Try Glue Databrew
Glue Databrew
Glue Jobs, Data Catalog and Crawler 2/2
A serverless CI/CD
RDS Aurora Serverless
• MySql and PostgreSQL supported (reuse
the experience of your team)
• Pay per ACU/hours (2 GB of memory)
• Scales from 1 to 256 ACU
• You can pause the cluster during the night
• Aurora Serverless v2 in preview for
MultiAz, Read-Replicas, faster scale
Glue Jobs, Data Catalog and Crawler 2/2
What we’ve learned
• Serverless gives you High Availability and
great scalability with no effort
• Pause Aurora Serverless v1 (it will take
about 30 seconds to restart)
• Use IaC (Cloudformation, Terraform, CDK,
etc) to deploy your infrastructure
• Tune your lamba memory using
https://github.com/alexcasalboni/aws-
lambda-power-tuning
• S3 is cheap but try not to write tiny
(<128KB) files
• Serverless can be pretty cheap if it’s used
in the right way
Questions?
www.besharp.it
info@besharp.it
+39 0382 1692920
beSharp srl - viale Ludovico il Moro 27 - 27100 Pavia (ITALY)
VAT ID IT02415160189
Follow beSharp on

beSharp a serverless approach to big data on aws

  • 1.
    AWS SummitEMEA Claudio Pontili AWSSenior Cloud Solution Architect Claudio.pontili@besharp.it AServerless approachtoBigDataonAWS
  • 3.
    Claudio Pontili 10+ yearsof experience on AWS Senior Cloud Solution Architect AWS Authorized Instructor Champion Claudio.pontili@besharp.it https://www.linkedin.com/in/claudiopontili/
  • 4.
    Agenda • Using Lambdafor ETLs • Glue ETLs • CI/CD to deploy code inside Lambdas and Glue Jobs • Datawarehousing on Aurora Serverless v1 • A full serverless Big Data Architecture • What we’ve learned
  • 5.
    Using Lambda forETLs 1/2 • You can use Python + Panda library • A lambda can have 10 GB of memory and a lot of CPU power • A lambda can run for 15 minutes • Max deployment package 50 MB (zipped) • Container image code package size 10 GB • /tmp directory storage 512 MB
  • 6.
  • 7.
    Glue Jobs, DataCatalog and Crawler 1/2 • Fully managed Data Catlog and Extract-Transform-Load (ETL) service • Automates data discovery, conversion, mapping and job scheduling • Glue runs your ETL jobs in an Apache Spark serverless envinronment • Allow to scale your ETLs jobs • Can easily schedule a crawler to to create a catalog of files stored on S3 • Too much code? Try Glue Databrew
  • 8.
  • 9.
    Glue Jobs, DataCatalog and Crawler 2/2
  • 10.
  • 11.
    RDS Aurora Serverless •MySql and PostgreSQL supported (reuse the experience of your team) • Pay per ACU/hours (2 GB of memory) • Scales from 1 to 256 ACU • You can pause the cluster during the night • Aurora Serverless v2 in preview for MultiAz, Read-Replicas, faster scale
  • 12.
    Glue Jobs, DataCatalog and Crawler 2/2
  • 13.
    What we’ve learned •Serverless gives you High Availability and great scalability with no effort • Pause Aurora Serverless v1 (it will take about 30 seconds to restart) • Use IaC (Cloudformation, Terraform, CDK, etc) to deploy your infrastructure • Tune your lamba memory using https://github.com/alexcasalboni/aws- lambda-power-tuning • S3 is cheap but try not to write tiny (<128KB) files • Serverless can be pretty cheap if it’s used in the right way
  • 14.
  • 15.
    www.besharp.it info@besharp.it +39 0382 1692920 beSharpsrl - viale Ludovico il Moro 27 - 27100 Pavia (ITALY) VAT ID IT02415160189 Follow beSharp on