More Related Content
Similar to Modern Data Platform on AWS (20)
More from Amazon Web Services (20)
Modern Data Platform on AWS
- 1. S U M M I T
Ams t e rd a m
- 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Modern Data Platform on AWS
Damon Cortesi
Big Data Architect - AWS
@dacort
A N T 0 0 1
David Morel
Takeaway.com
- 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A brief history of significant Big Data releases
2004
Google publishes
MapReduce paper
2006
Hadoop is created
HBase development starts
2008
Facebook launches
Hive
AWS EMR announced
2009
Facebook launches Presto
Apache Spark released
2012
MXNet
Paper Published
2015
Amazon Athena &
AWS Glue announced
2016
- 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data
every 5 years
There is more data
than people think
15
years
live for
Data platforms need to
1,000x
scale
>10x
grows
- 5. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
There are more
people accessing data
And more
requirements for
making data available
Data Scientists
Analysts
Business Users
Applications
Secure Real time
Flexible Scalable
- 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS databases and analytics
Broad and deep portfolio, built for builders
AWS Marketplace
Amazon Redshift
Data warehousing
Amazon EMR
Hadoop + Spark
Athena
Interactive analytics
Kinesis Analytics
Real-time
Amazon Elasticsearch service
Operational Analytics
RDS
MySQL, PostgreSQL, MariaDB,
Oracle, SQL Server
Aurora
MySQL, PostgreSQL
Amazon
QuickSight
Amazon
SageMaker
DynamoDB
Key value, Document
ElastiCache
Redis, Memcached
Neptune
Graph
Timestream
Time Series
QLDB
Ledger Database
S3/Amazon Glacier
AWS Glue
ETL & Data Catalog
Lake Formation
Data Lakes
Database Migration Service | Snowball | Snowmobile | Kinesis Data Firehose | Kinesis Data Streams | Data Pipeline | Direct Connect
Data Movement
AnalyticsDatabases
Business Intelligence & Machine Learning
Data Lake
Managed
Blockchain
Blockchain
Templates
Blockchain
Amazon
Comprehend
Amazon
Rekognition
Amazon
Lex
Amazon
Transcribe
AWS DeepLens 250+ solutions
730+ Database
solutions
600+ Analytics
solutions
25+ Blockchain
solutions
20+ Data lake
solutions
30+ solutions
RDS on VMWare
- 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
A data lake is a centralized repository that allows
you to store all your structured and unstructured
data at any scale
- 8. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lake with AWS Glue
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers
- 9. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
- 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
- 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data Movement From Real-time Sources
Amazon Kinesis
Video Streams
Securely stream video
from connected devices
to AWS for analytics,
machine learning (ML),
and other processing
Amazon Kinesis Data
Firehose
Capture, transform, and
load data streams into
AWS data stores for near
real-time analytics with
existing business
intelligence tools.
Amazon Kinesis Data
Streams
Build custom, real-time
applications that process
data streams using
popular stream
processing frameworks
AWS IoT Core
Supports billions of
devices and trillions of
messages, and can
process and route those
messages to AWS
endpoints and to other
devices reliably and
securely
Managed Streaming
For Kafka
Fully managed open-
source platform for
building real-time
streaming data pipelines
and applications.
- 12. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Kinesis Data Streams
- 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Kinesis Data Firehose
- 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Prefix: raw/life/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/
Buffer: Up to 128MB or 15 minutes
Kinesis events to S3
Kinesis Data
Streams
Kinesis Data
Firehose
Save as Parquet
Lambda
Transformation
Aggregated
JSON Data
Clients
Aggregated
Parquet Data
Source backup
New! as of 12th Feb
• Support for custom S3 prefix
Amazon Athena
Crawlers
- 15. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data Movement From On-premises Datacenters
AWS Snowball,
Snowball Edge and
Snowmobile
Petabyte and Exabyte-
scale data transport
solution that uses secure
appliances to transfer
large amounts of data
into and out of the AWS
cloud
AWS Direct Connect
Establish a dedicated
network connection from
your premises to AWS;
reduces your network
costs, increase bandwidth
throughput, and provide
a more consistent
network experience than
Internet-based
connections
AWS Storage
Gateway
Lets your on-premises
applications to use AWS
for storage; includes a
highly-optimized data
transfer mechanism,
bandwidth management,
along with local cache
AWS Database
Migration Service
Migrate database from
the most widely-used
commercial and open-
source offerings to AWS
quickly and securely with
minimal downtime to
applications
- 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Database Migration Service
- 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DMS to S3
AWS Database Migration
Service
Source
database
Crawlers Data catalogSnapshot
Data
AWS Glue
Amazon Athena
Amazon EMR
New! as of 25th March
• Support for Parquet
• Support for S3 encryption with KMS
Amazon Redshift
- 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
DMS to S3 Change Data Capture (CDC)
• Challenging to do easily
• Need to maintain a staging table and reconstitute dataset
newDf = df2.filter("cdc = 'I'")
updDf = df2.filter("cdc = 'U'")
delDf = df2.filter("cdc = 'D'”)
w = Window().partitionBy("id").orderBy(F.col("idx").desc())
latestUpdateDf = updDf.withColumn("rn", F.row_number()
.over(w)).where(F.col("rn") == 1).select("*").drop("rn")
# Create the update table, join to the original table,
# filter everything out of the original where the update is null, then union
tempDf = latestUpdateDf.select("id").withColumnRenamed("id", "id_1")
filteredBaseDf = insertsDf.join(tempDf, insertsDf.id == tempDf.id_1, 'left')
filteredBaseDf = filteredBaseDf.filter("id_1 is null").drop("id_1")
insertAndUpdateDdf = filteredBaseDf.union(latestUpdateDf)
# Ok, now remove any deleted columns!
tempDf = delDf.select("id").withColumnRenamed("id", "id_del")
finalDf = insertAndUpdateDdf.join(tempDf, insertAndUpdateDdf.id == tempDf.id_del, 'left')
finalDf = finalDf.filter("id_del is NOT null").drop("id_del")
- 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue ETL
New!
- 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Third-party API to S3
3rd Party
API
AWS Glue
Python Shell
Crawlers Data catalogIncremental
Exports
Amazon Athena
Glue ETL
Transformed
Data
Amazon Redshift
- 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Parquet File Format
Row group meta data
allows Parquet reader
to skip portions of, or
all files.
Columnar format is
optimized for
analytics.
Column meta-data
allows for pre-
aggregation
- 22. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Parquet
• Previously it was common to deliver in JSON/CSV/text then run
another process to convert to Parquet. It’s becoming more common to
deliver straight to Parquet.
• Kinesis Firehose – Added support May 2018
• Custom prefix support !: Feb 2019
• Requires schema in Glue Data Catalog
• Athena – CREATE TABLE AS SELECT: Oct 2018
• EMR – S3-optimized Parquet committer: Nov 2018
• Database Migration Service – Added Parquet support ": Mar 2019
- 23. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
- 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Glue ETL
New!
- 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon EMR
- 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Redshift
- 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon Athena
Permissions
Data Lake
AWS Cloud
AWS Cloud
Reporting
&
Analytics
Machine
Learning
AWS Cloud
Custom
Applications
AWS Glue
Data Catalog
- 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon EMR Notebooks in the Console
A managed analytics environment based on Jupyter Notebooks
Amazon EMR clusters
AWS Management
Console for EMR
EMR-managed notebook based
on Jupyter notebook
users
Auto saves notebook file to your S3 bucket
Run queries on your remote EMR cluster
EMR VPC
Customer VPC
- 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Amazon QuickSight
- 30. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
- 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Data lake with AWS Glue
Amazon S3
(Raw data)
Amazon S3
(Staging
data)
Amazon S3
(Processed data)
AWS Glue Data Catalog
Crawlers Crawlers Crawlers
- 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Enforce security policies
across multiple services
Gain and manage new
insights
Identify, ingest, clean, and
transform data
Build a secure data lake in days
AWS Lake Formation
- 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How it works
- 34. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Easily load data to your data lake
logs
DBs
Blueprints
Data Lake Storage
Data
Catalog
Access
Control
Data
import
Lake Formation
Crawlers ML-based
data prep
one-shot
incremental
- 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Blueprints build on AWS Glue
- 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Easily de-duplicate your data with ML transforms
- 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Secure once, access in multiple ways
Data Lake Storage
Data
Catalog
Access
Control
Lake Formation
Admin
- 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Security permissions in Lake Formation
Control data access with simple
grant and revoke permissions
Specify permissions on tables and
columns rather than on buckets
and objects
Easily view policies granted to a
particular user
Audit all data access at one place
- 39. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
AWS Lake Formation Pricing
No additional charges – Only pay for the
underlying services used.
- 40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
- 41. A tale of AWS at Takeaway.com
Data Engineering in the Business Intelligence team
- 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
- 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I TS U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.