© 2021, Amazon Web Services, Inc. or its Affiliates.
Hung Nguyen, hunggia@amazon.com
Senior Solutions Architect
Data Lake on AWS
Part 2
DevAx Online Workshop
© 2021, Amazon Web Services, Inc. or its Affiliates.
Agenda
• Review about Datalake
• Modernize Data Warehouse with Amazon Redshift
• Data Processing with Amazon EMR
• Event Driven Processing with AWS Lambda
© 2021, Amazon Web Services, Inc. or its Affiliates.
Review Datalake
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lakes Extend the Traditional Approach
Data Warehouse
Business Intelligence
OLTP ERP CRM LOB
• Relational and non-relational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake
© 2021, Amazon Web Services, Inc. or its Affiliates.
Data Lakes and Analytics from AWS
Cost-effective
Scalable and durable
Secure
Open and comprehensive
Analytics
Machine Learning
Real-time Data
Movement
On-premises
Data Movement
Data Lake
on AWS
© 2021, Amazon Web Services, Inc. or its Affiliates.
Modernize Data Warehouse
© 2021, Amazon Web Services, Inc. or its Affiliates.
Traditional architectures & on-prem data warehousing lead to dark data –
data that is collected but challenging to extract insights from that data.
Scale • Can’t scale easily or on-demand
• Long lead times for hardware procurement & upgrades
Cost • High overhead costs for administration
• Cold and warm data inseparable leading to bloated costs &
wasted capacity
Anti-democratization • Proprietary formats
• Data silos
• Need to ingest, transform data before analysis
• Limits on users and data
Legacy architecture
patterns
• One size fits all approach
Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
Machine
learning
BI +
analytics
Data
warehousing
Data lakes
Open formats
Central catalog
Traditional architectures lead to dark data
© 2021, Amazon Web Services, Inc. or its Affiliates.
What is the Amazon Redshift service?
Automated maintenance & workload
management; cost-effective cloud data
warehouse
Fully Managed
Extensive machine learning based optimizations
and features
Superior Speed
Service SLA: 99.9%
Highly-resilient
End-to-end encryption; SSO; compliance with
SOC 1/2/3, HIPAA, FedRamp & more
Secure
Query GBs to EBs; auto scaling; independent
compute/storage scaling
Scale
Tens of thousands of deployments; highly-rated
by agencies such as Gartner
Highly-rated & Most
Popular
Query data in-place in your data lake and RDS;
ACID and ANSI SQL
RDS, ML, and Data
Lake Integration
Query open formats in place Import/Export
Parquet & CSV; Query ORC, Avro, JSON, …
Commitment to Open
Formats
Not a standalone data warehouse, but a data warehouse
that breaks down the silos to keep data “free”
© 2021, Amazon Web Services, Inc. or its Affiliates.
BI Reporting Analytics
Typical use cases
© 2021, Amazon Web Services, Inc. or its Affiliates.
Redshift cluster architecture
• Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing &
ML optimizations
• Leader node is no-charge for clusters
with 2+ nodes
• Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore from S3
• Amazon Redshift Spectrum nodes
Execute queries directly against data lake
• Massively parallel, shared nothing
architecture
Load
Unload
Backup
Restore
JDBC/ODBC
SQL Clients /
BI Tools
Leader
node
Compute
node
Compute
node
Compute
node
…
…
…
…
...
1 2 3 4 N
Redshift
Spectrum
Load
Query
…
Amazon S3
Exabyte-scale object storage
Redshift Managed
Storage
© 2021, Amazon Web Services, Inc. or its Affiliates.
© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• A compute node is partitioned into slices
• A slice can be thought of as a “virtual compute node”
• Each slice is allocated a portion of the compute node's
memory and disk space, where it processes a portion of
the workload assigned to the compute node by the
leader node
• The leader node manages distributing data to the slices and
apportions the workload for any queries or other database
operations to the slices
• Slices are Redshift’s Symmetric Multiprocessing (SMP)
mechanism – they work in parallel to complete
operations
Compute Node
Compute node: Under the Hood (Slices)
• Amazon Redshift system architecture
Additional Documentation
Advanced
Note: Redshift supports "data slice" and "compute slice",
and soon, the compute slice can write as well.
Note: RAM size does not
indicate actual size
© 2021, Amazon Web Services, Inc. or its Affiliates.
A Redshift cluster can have up to 128 dc2.8xlarge or
RA3.16xlarge nodes (i.e. 326 TB or 16 PB of local or managed
storage, respectively) & can support EBs of data with its
Redshift Modern Data Architecture approach
Amazon Redshift RA3 (current generation)
• Solid-state disks + Amazon S3
• Amazon Redshift Managed Storage (RMS)
Dense compute - DC2
• Solid-state disks
Redshift instance types
Instance type Disk type Size Memory # CPUs # Slices
RA3 (New)
RA3 xlplus RMS Scales to 32 TB 32 GIB 4 2
RA3 4xlarge RMS Scales to 128 TB 96 GIB 12 4
RA3 16xlarge RMS Scales to 128 TB 384 GIB 48 16
Compute
Optimized
DC2 large SSD 160 GB 16 GIB 2 2
DC2 8xlarge SSD 2.56 TB 244 GIB 32 16
Dense storage - DS2 (legacy)
Magnetic disks
• Working with clusters
Additional Documentation
© 2021, Amazon Web Services, Inc. or its Affiliates.
Integrate with Data Lake
© 2021, Amazon Web Services, Inc. or its Affiliates.
• Customers are increasingly moving to data lake
architectures
• Amazon Redshift allows you to extend your data
warehouse, to your data lake - a Modern Data
Architecture
• Flexibility to store highly structured, frequently
accessed data in Redshift, keep in-frequently
used data in S3
• Query seamlessly across both to provide unique
insights
Amazon Redshift is the only data warehouse that
extends your queries to your Amazon S3 data lake
without moving data
Redshift Modern Data Architecture
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
Data Lake
Customers moving to data lake architectures
Redshift enables you to have a modern data architecture
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon Redshift Federated Query
Amazon RDS
PostgreSQL,
MySQL
Amazon Aurora
PostgreSQL,
MySQL
Amazon S3
data lake
Amazon Redshift
JDBC/ODBC
Query and join data from one or more
Amazon RDS and Aurora PostgreSQL databases
Analytics on operational data without data
movement and ETL delays
Use case: Integrate operational data with DW and
data lake for real-time analytics
Intelligent distribution of computation to remote
sources to optimize performance
Flexible and easy way to ingest data avoiding
complex ETL pipelines
Amazon RDS and Aurora MySQL support
© 2021, Amazon Web Services, Inc. or its Affiliates.
Run SQL queries directly against data in S3 using
thousands of nodes
Redshift Spectrum is a feature of Redshift that allows
SQL queries on external data stored in Amazon S3
Benefits
• Modern Data Architecture enables to query exabytes of data
in an S3 data lake
• Data is queried in-place, no loading of data
• Keeps your data warehouse lean by ingesting warm data
locally while keeping other data in the data lake within reach
• Write query results from Redshift direct to S3 external tables
• Powered by a separate fleet of powerful Amazon Redshift
Spectrum nodes
• Create materialized views on S3 data using Redshift
Spectrum queries
Spectrum
Redshift Spectrum Overview
© 2021, Amazon Web Services, Inc. or its Affiliates.
Standards, formats, and open source
Apache Flink
Ganglia
Apache HBase
HCatalog
Hadoop Distributed
File System (HDFS)
Apache Hive
Hudi
Java
JupyterHub
Apache Kafka
Apache Livy
• Apache
Mahout
• MapReduce
• Apache MXNet
• MySQL
• Apache Oozie
• Apache ORC
• Apache
Parquet
• Phoenix
• Apache Pig
• Presto
• Python
• PyTorch
• R
• Scala
• Apache Spark
• Sqoop
• SQL
• TensorFlow
• Tez
• Yarn
• Apache Zeppelin
• Apache Zookeeper
…and many more
17
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS alternatives to open source
Amazon EMR Amazon ES
Managed Streaming
for Apache Kafka
Real-time
analytics
Kafka
Operational
analytics
Elasticsearch
Logstash
Kibana
Spark, Hive, Presto,
Flink, HBase
Hadoop
Spark
© 2021, Amazon Web Services, Inc. or its Affiliates.
Amazon EMR
Easily run Spark, Hadoop, Hive,
Presto, HBase, and other big
data frameworks
Automate provisioning, configuring, and tuning
Run workloads faster and more cost-effectively
Automatically scale up and down
Simple and predictable pricing
Easy setup, management, and monitoring, with latest
open-source framework updates within 30 days
1.7x faster than standard Apache Spark 3.0 at 40% of the
cost, and 2.6x faster than open-source Presto 0.238 at 80%
of the cost
Manage cluster size based on utilization to reduce costs
Per-second pricing, and save 50%–80% with
Amazon EC2 Spot and Reserved Instances
© 2021, Amazon Web Services, Inc. or its Affiliates.
Understanding Cluster and Nodes
• The central component of Amazon EMR is
the cluster. A cluster is a collection of
Amazon Elastic Compute Cloud (Amazon
EC2) instances.
• Each instance in the cluster is called a node.
• Each node has a role within the cluster,
referred to as the node type.
• Amazon EMR also installs different
software components on each node type,
giving each node a role in a distributed
application like Apache Hadoop.
© 2021, Amazon Web Services, Inc. or its Affiliates.
Master instance group
EMR cluster
Task instance group
Core instance group
HDFS HDFS
Core nodes can be added
and removed gracefully
Master Node must keep
running
Cluster can tolerate loss
of task nodes.
EMR Node Types
© 2021, Amazon Web Services, Inc. or its Affiliates.
Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
Instance fleets for advanced Spot provisioning
© 2021, Amazon Web Services, Inc. or its Affiliates.
Cluster Types
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
Amazon S3
© 2021, Amazon Web Services, Inc. or its Affiliates.
Managed Scaling
• EMR Managed Scaling Capacity
• Control of Upper and Lower limits
• Control of Split between On-Demand and Spot instances
• Control Split between Core and Task Nodes
• Amazon EMR versions 5.30.0 and later (except for Amazon EMR 6.0.0)
© 2021, Amazon Web Services, Inc. or its Affiliates.
• Data warehouse, highly-relational,
complex joins
• Modern data architecture approach
• Sub-second latency
• Joins between data warehouse
data & an S3 data lake
Data Lake query services: How to choose?
Amazon EMR
Amazon Redshift Amazon Athena
• Interactive ad-hoc queries
• Serverless
• No data warehouse, not 24x7
• Log analysis
• Offload S3 workload from
Datawarehouse
• Process large volume of data
• Use big data tools like Apache
Hadoop, Spark, Presto, Hive
• Run Jupyter-based EMR
notebooks
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
Data Lake
© 2021, Amazon Web Services, Inc. or its Affiliates.
AWS Lambda
© 2021, Amazon Web Services, Inc. or its Affiliates.
Serverless Architecture
Event Source Function Services / Other
Node.js
Python
Java
C#
Go
Ruby
Bring Your Own
Changes in
data state
Requests to
endpoints
Changes in
resource state
© 2021, Amazon Web Services, Inc. or its Affiliates.
Anatomy of a Lambda Function
Handler function
• Function executed on invocation
• Processes incoming event
Event
• Invocation data sent to function
• Shape differs by event source
Context
• Additional information from Lambda service
• Examples: request ID, time remaining
def handler(event, context):
msg = ‘Hello {}’.format(
event[‘name’]
)
return { ‘message’: msg }
app.py
© 2021, Amazon Web Services, Inc. or its Affiliates.
Lambda Function Configuration
Power Rating
• Select between 128MB and 10GB
• CPU and network allocated
proportionally
• Power tune to balance cost and
speed
Permissions Model
• Execution Role grants function
access to resources via IAM
• Function Policy controls
invocation
128MB 10GB
© 2021, Amazon Web Services, Inc. or its Affiliates.
Lambda ideal usage pattern for data analytics
• Real-time file processing
• Real-time stream processing
• Extract, transform, load (ETL)
• Replace Cron
• Process AWS Events
• Datalake Data Serving
© 2021, Amazon Web Services, Inc. or its Affiliates.
Analytics on AWS
© 2021, Amazon Web Services, Inc. or its Affiliates.
Thank you!

Module 2 - Datalake

  • 1.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Hung Nguyen, hunggia@amazon.com Senior Solutions Architect Data Lake on AWS Part 2 DevAx Online Workshop
  • 2.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Agenda • Review about Datalake • Modernize Data Warehouse with Amazon Redshift • Data Processing with Amazon EMR • Event Driven Processing with AWS Lambda
  • 3.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Review Datalake
  • 4.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Data Lakes Extend the Traditional Approach Data Warehouse Business Intelligence OLTP ERP CRM LOB • Relational and non-relational data • TBs–EBs scale • Diverse analytical engines • Low-cost storage & analytics Devices Web Sensors Social Big Data processing, real-time, Machine Learning Data Lake
  • 5.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Data Lakes and Analytics from AWS Cost-effective Scalable and durable Secure Open and comprehensive Analytics Machine Learning Real-time Data Movement On-premises Data Movement Data Lake on AWS
  • 6.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Modernize Data Warehouse
  • 7.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Traditional architectures & on-prem data warehousing lead to dark data – data that is collected but challenging to extract insights from that data. Scale • Can’t scale easily or on-demand • Long lead times for hardware procurement & upgrades Cost • High overhead costs for administration • Cold and warm data inseparable leading to bloated costs & wasted capacity Anti-democratization • Proprietary formats • Data silos • Need to ingest, transform data before analysis • Limits on users and data Legacy architecture patterns • One size fits all approach Data silos to OLTP ERP CRM LOB DW Silo 1 Business Intelligence Devices Web Sensors Social DW Silo 2 Business Intelligence Machine learning BI + analytics Data warehousing Data lakes Open formats Central catalog Traditional architectures lead to dark data
  • 8.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. What is the Amazon Redshift service? Automated maintenance & workload management; cost-effective cloud data warehouse Fully Managed Extensive machine learning based optimizations and features Superior Speed Service SLA: 99.9% Highly-resilient End-to-end encryption; SSO; compliance with SOC 1/2/3, HIPAA, FedRamp & more Secure Query GBs to EBs; auto scaling; independent compute/storage scaling Scale Tens of thousands of deployments; highly-rated by agencies such as Gartner Highly-rated & Most Popular Query data in-place in your data lake and RDS; ACID and ANSI SQL RDS, ML, and Data Lake Integration Query open formats in place Import/Export Parquet & CSV; Query ORC, Avro, JSON, … Commitment to Open Formats Not a standalone data warehouse, but a data warehouse that breaks down the silos to keep data “free”
  • 9.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. BI Reporting Analytics Typical use cases
  • 10.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Redshift cluster architecture • Leader node • SQL endpoint • Stores metadata • Coordinates parallel SQL processing & ML optimizations • Leader node is no-charge for clusters with 2+ nodes • Compute nodes • Local, columnar storage • Executes queries in parallel • Load, unload, backup, restore from S3 • Amazon Redshift Spectrum nodes Execute queries directly against data lake • Massively parallel, shared nothing architecture Load Unload Backup Restore JDBC/ODBC SQL Clients / BI Tools Leader node Compute node Compute node Compute node … … … … ... 1 2 3 4 N Redshift Spectrum Load Query … Amazon S3 Exabyte-scale object storage Redshift Managed Storage
  • 11.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. © 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • A compute node is partitioned into slices • A slice can be thought of as a “virtual compute node” • Each slice is allocated a portion of the compute node's memory and disk space, where it processes a portion of the workload assigned to the compute node by the leader node • The leader node manages distributing data to the slices and apportions the workload for any queries or other database operations to the slices • Slices are Redshift’s Symmetric Multiprocessing (SMP) mechanism – they work in parallel to complete operations Compute Node Compute node: Under the Hood (Slices) • Amazon Redshift system architecture Additional Documentation Advanced Note: Redshift supports "data slice" and "compute slice", and soon, the compute slice can write as well. Note: RAM size does not indicate actual size
  • 12.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. A Redshift cluster can have up to 128 dc2.8xlarge or RA3.16xlarge nodes (i.e. 326 TB or 16 PB of local or managed storage, respectively) & can support EBs of data with its Redshift Modern Data Architecture approach Amazon Redshift RA3 (current generation) • Solid-state disks + Amazon S3 • Amazon Redshift Managed Storage (RMS) Dense compute - DC2 • Solid-state disks Redshift instance types Instance type Disk type Size Memory # CPUs # Slices RA3 (New) RA3 xlplus RMS Scales to 32 TB 32 GIB 4 2 RA3 4xlarge RMS Scales to 128 TB 96 GIB 12 4 RA3 16xlarge RMS Scales to 128 TB 384 GIB 48 16 Compute Optimized DC2 large SSD 160 GB 16 GIB 2 2 DC2 8xlarge SSD 2.56 TB 244 GIB 32 16 Dense storage - DS2 (legacy) Magnetic disks • Working with clusters Additional Documentation
  • 13.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Integrate with Data Lake
  • 14.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. • Customers are increasingly moving to data lake architectures • Amazon Redshift allows you to extend your data warehouse, to your data lake - a Modern Data Architecture • Flexibility to store highly structured, frequently accessed data in Redshift, keep in-frequently used data in S3 • Query seamlessly across both to provide unique insights Amazon Redshift is the only data warehouse that extends your queries to your Amazon S3 data lake without moving data Redshift Modern Data Architecture Non- relational databases Machine learning Data warehousing Log analytics Big data processing Relational databases Data Lake Customers moving to data lake architectures Redshift enables you to have a modern data architecture
  • 15.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Amazon Redshift Federated Query Amazon RDS PostgreSQL, MySQL Amazon Aurora PostgreSQL, MySQL Amazon S3 data lake Amazon Redshift JDBC/ODBC Query and join data from one or more Amazon RDS and Aurora PostgreSQL databases Analytics on operational data without data movement and ETL delays Use case: Integrate operational data with DW and data lake for real-time analytics Intelligent distribution of computation to remote sources to optimize performance Flexible and easy way to ingest data avoiding complex ETL pipelines Amazon RDS and Aurora MySQL support
  • 16.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Run SQL queries directly against data in S3 using thousands of nodes Redshift Spectrum is a feature of Redshift that allows SQL queries on external data stored in Amazon S3 Benefits • Modern Data Architecture enables to query exabytes of data in an S3 data lake • Data is queried in-place, no loading of data • Keeps your data warehouse lean by ingesting warm data locally while keeping other data in the data lake within reach • Write query results from Redshift direct to S3 external tables • Powered by a separate fleet of powerful Amazon Redshift Spectrum nodes • Create materialized views on S3 data using Redshift Spectrum queries Spectrum Redshift Spectrum Overview
  • 17.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Standards, formats, and open source Apache Flink Ganglia Apache HBase HCatalog Hadoop Distributed File System (HDFS) Apache Hive Hudi Java JupyterHub Apache Kafka Apache Livy • Apache Mahout • MapReduce • Apache MXNet • MySQL • Apache Oozie • Apache ORC • Apache Parquet • Phoenix • Apache Pig • Presto • Python • PyTorch • R • Scala • Apache Spark • Sqoop • SQL • TensorFlow • Tez • Yarn • Apache Zeppelin • Apache Zookeeper …and many more 17
  • 18.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS alternatives to open source Amazon EMR Amazon ES Managed Streaming for Apache Kafka Real-time analytics Kafka Operational analytics Elasticsearch Logstash Kibana Spark, Hive, Presto, Flink, HBase Hadoop Spark
  • 19.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Amazon EMR Easily run Spark, Hadoop, Hive, Presto, HBase, and other big data frameworks Automate provisioning, configuring, and tuning Run workloads faster and more cost-effectively Automatically scale up and down Simple and predictable pricing Easy setup, management, and monitoring, with latest open-source framework updates within 30 days 1.7x faster than standard Apache Spark 3.0 at 40% of the cost, and 2.6x faster than open-source Presto 0.238 at 80% of the cost Manage cluster size based on utilization to reduce costs Per-second pricing, and save 50%–80% with Amazon EC2 Spot and Reserved Instances
  • 20.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Understanding Cluster and Nodes • The central component of Amazon EMR is the cluster. A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. • Each instance in the cluster is called a node. • Each node has a role within the cluster, referred to as the node type. • Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop.
  • 21.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Master instance group EMR cluster Task instance group Core instance group HDFS HDFS Core nodes can be added and removed gracefully Master Node must keep running Cluster can tolerate loss of task nodes. EMR Node Types
  • 22.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Master Node Core Instance Fleet Task Instance Fleet • Provision from a list of instance types with Spot and On-Demand • Launch in the most optimal Availability Zone based on capacity/price Instance fleets for advanced Spot provisioning
  • 23.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Cluster Types Persistent Cluster – Interactive Queries (Spark-SQL | Presto) Transient Cluster - Batch Jobs (X hours nightly) – Add/Remove Nodes Amazon S3
  • 24.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Managed Scaling • EMR Managed Scaling Capacity • Control of Upper and Lower limits • Control of Split between On-Demand and Spot instances • Control Split between Core and Task Nodes • Amazon EMR versions 5.30.0 and later (except for Amazon EMR 6.0.0)
  • 25.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. • Data warehouse, highly-relational, complex joins • Modern data architecture approach • Sub-second latency • Joins between data warehouse data & an S3 data lake Data Lake query services: How to choose? Amazon EMR Amazon Redshift Amazon Athena • Interactive ad-hoc queries • Serverless • No data warehouse, not 24x7 • Log analysis • Offload S3 workload from Datawarehouse • Process large volume of data • Use big data tools like Apache Hadoop, Spark, Presto, Hive • Run Jupyter-based EMR notebooks Non- relational databases Machine learning Data warehousing Log analytics Big data processing Relational databases Data Lake
  • 26.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. AWS Lambda
  • 27.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Serverless Architecture Event Source Function Services / Other Node.js Python Java C# Go Ruby Bring Your Own Changes in data state Requests to endpoints Changes in resource state
  • 28.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Anatomy of a Lambda Function Handler function • Function executed on invocation • Processes incoming event Event • Invocation data sent to function • Shape differs by event source Context • Additional information from Lambda service • Examples: request ID, time remaining def handler(event, context): msg = ‘Hello {}’.format( event[‘name’] ) return { ‘message’: msg } app.py
  • 29.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Lambda Function Configuration Power Rating • Select between 128MB and 10GB • CPU and network allocated proportionally • Power tune to balance cost and speed Permissions Model • Execution Role grants function access to resources via IAM • Function Policy controls invocation 128MB 10GB
  • 30.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Lambda ideal usage pattern for data analytics • Real-time file processing • Real-time stream processing • Extract, transform, load (ETL) • Replace Cron • Process AWS Events • Datalake Data Serving
  • 31.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Analytics on AWS
  • 32.
    © 2021, AmazonWeb Services, Inc. or its Affiliates. Thank you!