Serverless Datalake Day with AWS

© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Serverless Data Lake Day with AWS
Tame Your Big Data with Kinesis Firehose, S3, Glue, Athena, Quicksight
Kim Kao, Solutions Architect

2
Big Data still challenges most enterprises
Store AnalyzeIngest
1 4
0 9
5CUSTOMER &
OPERATIONAL
DATA
CUSTOMER &
OPERATIONAL
INSIGHTS
ANALYTICS PIPELINE
Rigid data
ingest
Inability to process
new types of data
Limited analytics
platforms
Inability to generate
new types of insights
Data doesn’t
“connect”
Siloed data
prevents single
source of truth

3
Data Lake concepts

4
The concept of a Data Lake
• All data in one place, a single source of truth
• Handles structured/semi-structured/unstructured/raw data
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules

5
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale

6
Typical steps of building a data lake
Setup storage1
Move data2
Cleanse, prep,
and catalog data
3
Configure and enforce
security and compliance
policies
4
Make data available
for analytics and
consumption
5

7
Workshop Solution Architecture
Architecture Highlights
• Serverless
• Hybrid
PB3: Note: Draw.io source for the diagram will be shared at the Immersion Day

8
Service Used in this Lab
S3: Store data at Any Scalable with Unmatched Durability & Availability
• Built to store any amount of data
• Runs on the world’s largest global
cloud infrastructure
• Designed to deliver 99.999999999% durability
• Geographic redundancy & automatic replication
• Seamlessly replicates data between any region
• Tiered storage to optimize price/performance:
Store data at $0.023/GB/month at S3
($0.004/GB/month at Glacier)
S3
Standard
S3 Standard
Infrequent Access
S3 One Zone-IA
Glacier

9
Process Data in Place…
Amazon S3
Amazon Athena Amazon Redshift
Spectrum
Amazon SageMaker AWS Glue

10
Amazon Kinesis—Real Time
time
Load data streams
into AWS data stores
Kinesis Data
Firehose
Build custom
applications that
analyze data streams
Kinesis Data
Streams
Capture, process,
and store video
streams for analytics
Kinesis Video
Streams
Analyze data streams
with SQL
Kinesis Data
Analytics
SQL

11
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift,
and Amazon ES
Service Used in this Lab
Amazon Kinesis Firehose
Zero administration: Capture and deliver streaming data to Amazon S3, Amazon Redshift, or
Amazon Elasticsearch Service without writing an app or managing infrastructure.
Direct-to-data-store integration: Batch, compress, and encrypt streaming data for delivery
in as little as 60 seconds.
Seamless elasticity: Seamlessly scales to match data throughput without intervention.

12
Kinesis Data Firehose—How it Works
Ingest Transform Deliver
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka

13
Kinesis Streams Kinesis Firehose
Use case Capture and expose data streams to build
arbitrary stream processing applications
Fully-managed service for automatic
loading of data streams into S3,
Redshift, ElasticSearch etc.
Provisioning and
Resource Management
Provisioned model via Streams No provisioning of the underlying
resource called a
DeliveryStream
Customer-controllable construct of ‘Shards’ No Shards (not a customer visible
construct)
Scaling Customer owns explicit scaling of stream/
shards via Split/Merge APIs
Firehose is completely elastic – no
explicit scaling actions needed.
Ingestion (Put Data) Customer owns “partition key’ as part of
PUT* API call
Different and simplified Put* API that
doesn’t need a partition key
Retrieval (Get Data) Get API to retrieve data directly from stream No Get* API – Data appears in
destination (like S3/ Redshift) after
meeting configuration criteria
Full-flexibility to write/ manage own
application with Kinesis Client Library and
connector libraries
No application to be written or
managed. Customers specify simple
set of pre-defined configurations on the
console or API
Amazon Kinesis Streams vs Firehose

14
Firehose Highlights
• Records: The maximum size of a record (before Base64-encoding)
is 1000 KB.
• Compression: Amazon Kinesis Data Firehose allows you to
compress your data before delivering it to Amazon S3.
• Transformation: Firehose can invoke an AWS Lambda function to
transform incoming data before delivering it to destinations.
• Buffering
• Buffer size is in MBs and ranges from 1MB to 128MB.
• Buffer interval is in seconds and ranges from 60 seconds to 900
seconds.
Source: Firehose FAQ

Serverless Datalake Day with AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Serverless Datalake Day with AWS

Similar to Serverless Datalake Day with AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Serverless Datalake Day with AWS