Module 2 - Datalake

© 2021, Amazon Web Services, Inc. or its Affiliates.
Hung Nguyen, hunggia@amazon.com
Senior Solutions Architect
Data Lake on AWS
Part 2
DevAx Online Workshop

Agenda
• Review about Datalake
• Modernize Data Warehouse with Amazon Redshift
• Data Processing with Amazon EMR
• Event Driven Processing with AWS Lambda

Review Datalake

Data Lakes Extend the Traditional Approach
Data Warehouse
Business Intelligence
OLTP ERP CRM LOB
• Relational and non-relational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Big Data processing,
real-time, Machine Learning
Data Lake

Data Lakes and Analytics from AWS
Cost-effective
Scalable and durable
Secure
Open and comprehensive
Analytics
Machine Learning
Real-time Data
Movement
On-premises
Data Movement
Data Lake
on AWS

Modernize Data Warehouse

Traditional architectures & on-prem data warehousing lead to dark data –
data that is collected but challenging to extract insights from that data.
Scale • Can’t scale easily or on-demand
• Long lead times for hardware procurement & upgrades
Cost • High overhead costs for administration
• Cold and warm data inseparable leading to bloated costs &
wasted capacity
Anti-democratization • Proprietary formats
• Data silos
• Need to ingest, transform data before analysis
• Limits on users and data
Legacy architecture
patterns
• One size fits all approach
Data silos to
OLTP ERP CRM LOB
DW Silo 1
Business
Intelligence
Devices Web Sensors Social
DW Silo 2
Business
Intelligence
Machine
learning
BI +
analytics
Data
warehousing
Data lakes
Open formats
Central catalog
Traditional architectures lead to dark data

What is the Amazon Redshift service?
Automated maintenance & workload
management; cost-effective cloud data
warehouse
Fully Managed
Extensive machine learning based optimizations
and features
Superior Speed
Service SLA: 99.9%
Highly-resilient
End-to-end encryption; SSO; compliance with
SOC 1/2/3, HIPAA, FedRamp & more
Secure
Query GBs to EBs; auto scaling; independent
compute/storage scaling
Scale
Tens of thousands of deployments; highly-rated
by agencies such as Gartner
Highly-rated & Most
Popular
Query data in-place in your data lake and RDS;
ACID and ANSI SQL
RDS, ML, and Data
Lake Integration
Query open formats in place Import/Export
Parquet & CSV; Query ORC, Avro, JSON, …
Commitment to Open
Formats
Not a standalone data warehouse, but a data warehouse
that breaks down the silos to keep data “free”

BI Reporting Analytics
Typical use cases

Redshift cluster architecture
• Leader node
• SQL endpoint
• Stores metadata
• Coordinates parallel SQL processing &
ML optimizations
• Leader node is no-charge for clusters
with 2+ nodes
• Compute nodes
• Local, columnar storage
• Executes queries in parallel
• Load, unload, backup, restore from S3
• Amazon Redshift Spectrum nodes
Execute queries directly against data lake
• Massively parallel, shared nothing
architecture
Load
Unload
Backup
Restore
JDBC/ODBC
SQL Clients /
BI Tools
Leader
node
Compute
node
Compute
node
Compute
node
…
…
…
…
...
1 2 3 4 N
Redshift
Spectrum
Load
Query
…
Amazon S3
Exabyte-scale object storage
Redshift Managed
Storage

© 2021, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• A compute node is partitioned into slices
• A slice can be thought of as a “virtual compute node”
• Each slice is allocated a portion of the compute node's
memory and disk space, where it processes a portion of
the workload assigned to the compute node by the
leader node
• The leader node manages distributing data to the slices and
apportions the workload for any queries or other database
operations to the slices
• Slices are Redshift’s Symmetric Multiprocessing (SMP)
mechanism – they work in parallel to complete
operations
Compute Node
Compute node: Under the Hood (Slices)
• Amazon Redshift system architecture
Additional Documentation
Advanced
Note: Redshift supports "data slice" and "compute slice",
and soon, the compute slice can write as well.
Note: RAM size does not
indicate actual size

A Redshift cluster can have up to 128 dc2.8xlarge or
RA3.16xlarge nodes (i.e. 326 TB or 16 PB of local or managed
storage, respectively) & can support EBs of data with its
Redshift Modern Data Architecture approach
Amazon Redshift RA3 (current generation)
• Solid-state disks + Amazon S3
• Amazon Redshift Managed Storage (RMS)
Dense compute - DC2
• Solid-state disks
Redshift instance types
Instance type Disk type Size Memory # CPUs # Slices
RA3 (New)
RA3 xlplus RMS Scales to 32 TB 32 GIB 4 2
RA3 4xlarge RMS Scales to 128 TB 96 GIB 12 4
RA3 16xlarge RMS Scales to 128 TB 384 GIB 48 16
Compute
Optimized
DC2 large SSD 160 GB 16 GIB 2 2
DC2 8xlarge SSD 2.56 TB 244 GIB 32 16
Dense storage - DS2 (legacy)
Magnetic disks
• Working with clusters
Additional Documentation

Integrate with Data Lake

• Customers are increasingly moving to data lake
architectures
• Amazon Redshift allows you to extend your data
warehouse, to your data lake - a Modern Data
Architecture
• Flexibility to store highly structured, frequently
accessed data in Redshift, keep in-frequently
used data in S3
• Query seamlessly across both to provide unique
insights
Amazon Redshift is the only data warehouse that
extends your queries to your Amazon S3 data lake
without moving data
Redshift Modern Data Architecture
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
Data Lake
Customers moving to data lake architectures
Redshift enables you to have a modern data architecture

Amazon Redshift Federated Query
Amazon RDS
PostgreSQL,
MySQL
Amazon Aurora
PostgreSQL,
MySQL
Amazon S3
data lake
Amazon Redshift
JDBC/ODBC
Query and join data from one or more
Amazon RDS and Aurora PostgreSQL databases
Analytics on operational data without data
movement and ETL delays
Use case: Integrate operational data with DW and
data lake for real-time analytics
Intelligent distribution of computation to remote
sources to optimize performance
Flexible and easy way to ingest data avoiding
complex ETL pipelines
Amazon RDS and Aurora MySQL support

Run SQL queries directly against data in S3 using
thousands of nodes
Redshift Spectrum is a feature of Redshift that allows
SQL queries on external data stored in Amazon S3
Benefits
• Modern Data Architecture enables to query exabytes of data
in an S3 data lake
• Data is queried in-place, no loading of data
• Keeps your data warehouse lean by ingesting warm data
locally while keeping other data in the data lake within reach
• Write query results from Redshift direct to S3 external tables
• Powered by a separate fleet of powerful Amazon Redshift
Spectrum nodes
• Create materialized views on S3 data using Redshift
Spectrum queries
Spectrum
Redshift Spectrum Overview

Standards, formats, and open source
Apache Flink
Ganglia
Apache HBase
HCatalog
Hadoop Distributed
File System (HDFS)
Apache Hive
Hudi
Java
JupyterHub
Apache Kafka
Apache Livy
• Apache
Mahout
• MapReduce
• Apache MXNet
• MySQL
• Apache Oozie
• Apache ORC
• Apache
Parquet
• Phoenix
• Apache Pig
• Presto
• Python
• PyTorch
• R
• Scala
• Apache Spark
• Sqoop
• SQL
• TensorFlow
• Tez
• Yarn
• Apache Zeppelin
• Apache Zookeeper
…and many more
17

AWS alternatives to open source
Amazon EMR Amazon ES
Managed Streaming
for Apache Kafka
Real-time
analytics
Kafka
Operational
analytics
Elasticsearch
Logstash
Kibana
Spark, Hive, Presto,
Flink, HBase
Hadoop
Spark

Amazon EMR
Easily run Spark, Hadoop, Hive,
Presto, HBase, and other big
data frameworks
Automate provisioning, configuring, and tuning
Run workloads faster and more cost-effectively
Automatically scale up and down
Simple and predictable pricing
Easy setup, management, and monitoring, with latest
open-source framework updates within 30 days
1.7x faster than standard Apache Spark 3.0 at 40% of the
cost, and 2.6x faster than open-source Presto 0.238 at 80%
of the cost
Manage cluster size based on utilization to reduce costs
Per-second pricing, and save 50%–80% with
Amazon EC2 Spot and Reserved Instances

Understanding Cluster and Nodes
• The central component of Amazon EMR is
the cluster. A cluster is a collection of
Amazon Elastic Compute Cloud (Amazon
EC2) instances.
• Each instance in the cluster is called a node.
• Each node has a role within the cluster,
referred to as the node type.
• Amazon EMR also installs different
software components on each node type,
giving each node a role in a distributed
application like Apache Hadoop.

Master instance group
EMR cluster
Task instance group
Core instance group
HDFS HDFS
Core nodes can be added
and removed gracefully
Master Node must keep
running
Cluster can tolerate loss
of task nodes.
EMR Node Types

Master Node Core Instance Fleet Task Instance Fleet
• Provision from a list of instance types with Spot and On-Demand
• Launch in the most optimal Availability Zone based on capacity/price
Instance fleets for advanced Spot provisioning

Cluster Types
Persistent Cluster – Interactive Queries
(Spark-SQL | Presto)
Transient Cluster - Batch Jobs
(X hours nightly) – Add/Remove Nodes
Amazon S3

Managed Scaling
• EMR Managed Scaling Capacity
• Control of Upper and Lower limits
• Control of Split between On-Demand and Spot instances
• Control Split between Core and Task Nodes
• Amazon EMR versions 5.30.0 and later (except for Amazon EMR 6.0.0)

• Data warehouse, highly-relational,
complex joins
• Modern data architecture approach
• Sub-second latency
• Joins between data warehouse
data & an S3 data lake
Data Lake query services: How to choose?
Amazon EMR
Amazon Redshift Amazon Athena
• Interactive ad-hoc queries
• Serverless
• No data warehouse, not 24x7
• Log analysis
• Offload S3 workload from
Datawarehouse
• Process large volume of data
• Use big data tools like Apache
Hadoop, Spark, Presto, Hive
• Run Jupyter-based EMR
notebooks
Non-
relational
databases
Machine
learning
Data
warehousing
Log
analytics
Big data
processing
Relational
databases
Data Lake

AWS Lambda

Serverless Architecture
Event Source Function Services / Other
Node.js
Python
Java
C#
Go
Ruby
Bring Your Own
Changes in
data state
Requests to
endpoints
Changes in
resource state

Anatomy of a Lambda Function
Handler function
• Function executed on invocation
• Processes incoming event
Event
• Invocation data sent to function
• Shape differs by event source
Context
• Additional information from Lambda service
• Examples: request ID, time remaining
def handler(event, context):
msg = ‘Hello {}’.format(
event[‘name’]
)
return { ‘message’: msg }
app.py

Lambda Function Configuration
Power Rating
• Select between 128MB and 10GB
• CPU and network allocated
proportionally
• Power tune to balance cost and
speed
Permissions Model
• Execution Role grants function
access to resources via IAM
• Function Policy controls
invocation
128MB 10GB

Lambda ideal usage pattern for data analytics
• Real-time file processing
• Real-time stream processing
• Extract, transform, load (ETL)
• Replace Cron
• Process AWS Events
• Datalake Data Serving

Analytics on AWS

Thank you!

Module 2 - Datalake

More Related Content

What's hot

Similar to Module 2 - Datalake

More from Lam Le

Recently uploaded

Module 2 - Datalake