Building Data Lakes and Analytics on AWS

Build Data Lakes and Analytics on AWS
@2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

VisualizationVariability
Data Is Defined Many Different Ways
Volume Velocity Variety Veracity Value

Data Is Changing à Analytics Are Adopting
Capture and store
new data at PB-EB
scale
Do new type of analytics
in a cost effective way
• Machine learning
• Big data processing
• Real-time analytics
• Full-text search
New types of
analytics

Public Sector entities that successfully generate value from their data will be
able to offer better citizen services and data driven decisions
Most Important: Driving Value from Data
What are those use cases?
Analytics
Smart Cities
AI/ML Data lake

*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
What is Data Lake?

Traditionally, Analytics Used to Look Like This
OLTP ERP CRM LOB
Data warehouse
Business intelligence • Relational data
• TBs–PBs scale
• Schema defined prior to data load
• Operational reporting and ad hoc
• Large initial CAPEX + $10K–$50K/TB/year

Data Lakes Extend the Traditional Approach
Data warehouse
Business intelligence
OLTP ERP CRM LOB
• Relational and nonrelational data
• TBs–EBs scale
• Diverse analytical engines
• Low-cost storage & analytics
Devices Web Sensors Social
Data lake
Big data processing,
real-time, machine learning

A Data Lake is not an Enterprise Data Warehouse
Complementary to EDW (not replacement) EDW can be sourced from Data Lake
Schema on read (no predefined schemas) Schema on write (predefined schemas)
Structured/semi-structured/Unstructured data Structured data only
Fast ingestion of new data/content Time consuming to introduce new content
Data Science + Prediction/Advanced Analytics + BI use
cases
BI use cases
Data at low level of detail/granularity Data at summary/aggregated level of detail
Loosely defined SLAs Tight SLAs (production schedules)
Flexibility in tools (open source/tools for advanced
analytics)
Limited flexibility in tools (SQL only)
Elastic storage and compute capacity – decoupled
Explicitly sized environments, compute and storage
scaled in linearly
Data Lake EDW

Data Lakes from AWS
Analytics
• Unmatched durability, and availability at EB scale
• Best security, compliance, and audit capabilities
• Object-level controls for fine-grain access
• Fastest performance by retrieving subsets of data
• The most ways to bring data in
• 2x as many integrations with partners
• Analyze with broadest set of analytics & ML services
Machine
learning
Real-time dataOn-premises
Data Lake
on AWS
movementdata movement

Managed ML Service
Deep Learning AMIs
Video and Image Recognition
Conversational Interfaces
Deep-Learning Video Camera
Natural Language Processing
Language Translation
Speech Recognition
Text-to-Speech
Interactive Analysis
Hadoop & Spark
Data Warehousing
Full-text search
Real-time analytics
Dashboards & Visualizations
Dedicated Network connection
Secure appliances
Ruggedized Shipping Container
Database migration
Connect Devices to AWS
Real-time Data Streams
Real-time Video Streams
Data Lake
on AWS
Storage | Archival Storage | Data Catalog
AnalyticsMachine learning
Real-time dataOn-premises movementdata movement
Data Lakes, Analytics, and IoT Portfolio from AWS
Broadest, deepest set of analytic services

Data Lakes, Analytics, and IoT Portfolio from AWS
Broadest, deepest set of analytic services
Amazon SageMaker
AWS Deep Learning AMIs
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Elasticsearch Service
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Database Migration Service
AWS IoT Core
Amazon Kinesis Data Firehose
Amazon Kinesis Data Streams
Amazon Kinesis Video Streams
Data Lake
on AWS

How do I ingest my data?

How do I drive value?
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake on AWS
Real-time data movementTraditional data movement

Ingest data based on the type of data
Open and comprehensive
• Data movement from on-premises
datacenters
• Dedicated network connection
• Secure appliances
• Ruggedized shipping container
• Database migration
• Gateway that lets applications write to the cloud
• Data movement from real-time sources
• Connect devices to AWS
• Real-time data streams
• Real-time video streams
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS Storage Gateway
AWS IoT Core
Data movement from
real-time sources
Data movement from
your datacenters
Amazon S 3
Amazon Gl ac ier
AWS Gl u e

Real-time data movement and data lakes on
AWS
Amazon
Kinesis Data
Firehose
AWS Glue
Data Catalog
Amazon
S3 Data
Data Lake
on AWS
Amazon
Kinesis Data
Streams
Data definitionKinesis Agent
Apache Kafka
AWS SDK
LOG4J
Flume
Fluentd
AWS Mobile SDK
Kinesis Producer Library

IMPORTANT: Ingest data in its raw form …
Open and comprehensive
Amazon S 3
Amazon Gl ac ier
AWS Gl u e
• Store the data in its raw form:
• BEFORE
• Transforming
• Analyzing
• Manipulating
• Doing … anything … to it
CSV
ORC
Grok
Avro
Parquet
JSON
• This becomes your source of record you can
always go back to …
• Lifecycle policies allow you to shift it to
warm and cold storage.

Preparing raw data for consumption
Raw data stored in Data Lake:
Preparation:
Normalized
Partitioned
Compressed
Storage Optimized
Extract – Load – Transform
Raw
Ingestion
Curated
DataSets
Data Catalog
ELT

Which tool should I use to
analyze my data?

How do I drive value?
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS
AnalyticsMachine Learning
Real-time dataTraditional movementdata movement

Different tools for different users…
Business
Reporting
Data
Catalog
Central
Storage
SagemakerMachine Learning/Deep Learning
Data Scientists
Data Engineer

Amazon Athena – interactive analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
$ SQL
Query instantly
Zero setup cost; just
point to Amazon S3
and start querying
Pay per query
Pay only for queries run;
save 30%–90% on per-
query costs through
compression
Open
ANSI SQL interface,
JDBC/ODBC drivers, multiple
formats, compression types,
and complex joins and data
types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with Amazon
QuickSight

Amazon EMR – big data processing
Analytics and ML at scale
19 open-source projects: Apache Hadoop, Spark, HBase, Presto, and more
Enterprise-grade security
$
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Low cost
Flexible billing with per-
second billing, Amazon
EC2 Spot, Reserved
Instances, and Auto
Scaling to reduce costs
50%-80%
Use Amazon S3 storage
Process data directly in
the Amazon S3 data lake
securely with high
performance using the
EMRFS connector
Easy
Launch fully managed
Hadoop & Spark in minutes;
no cluster setup, node
provisioning, cluster tuning
Data Lake
100110000100101011100
1010101110010101000
00111100101100101
010001100001

Amazon Redshift – data warehousing
Fast, powerful, simple, and fully managed data warehouse at 1/10 the cost
Massively parallel, scale from gigabytes to petabytes
Fast at scale
Columnar storage
technology to improve I/O
efficiency and scale query
performance
$
Inexpensive
As low as $1,000 per
terabyte per year, 1/10 the
cost of traditional data
warehouse solutions; start
at $0.25 per hour
Open file formats Secure
Audit everything; encrypt
data end-to-end;
extensive certification and
compliance
Analyze optimized data
formats on the latest SSD,
and all open data formats in
Amazon S3

Machine Learning & Big Data

Data Lakes driving Machine Learning
Better
Decisions
Object Storage
Databases
Data warehouse
Streaming analytics
BI
Hadoop
Spark/Presto
Elasticsearch
Better
Products Machine Learning
Deep Learning/ AI
More
Users
More
Data
Click stream
User activity
Generated content
Purchases
Clicks
Likes
Sensor data

Agility in Machine Learning
Amazon SageMaker
Amazon Rekognition
Amazon Lex
AWS DeepLens
Amazon Comprehend
Amazon Translate
Amazon Transcribe
Amazon Polly
Amazon Athena
Amazon EMR
Amazon Redshift
Amazon Kinesis
Amazon QuickSight
AWS Direct Connect
AWS Snowball
AWS Snowmobile
AWS IoT Core
Data Lake
on AWS
AnalyticsMachine Learning

Varied ML Use Cases

Broadest and deepest set of capabilities
The Amazon ML Stack

Modern data architecture for media enrichment
Insights to enhance viewer engagement, personalization, monetization

Activity Highlights and Suspicious Activity Pathing

Sentiment analysis
Discover insights and relationships in text, Social media, etc..

In Summary…
• Data lakes and data warehouses complement each
other
• Loose coupling, but highly performant
• Storage, analytics, metadata management, etc..
• Future-proof your analytics
• Choosing the best tool for the job
• Elasticity and multiple clusters for dedicated purposes
• Don’t forget metadata management

*Aberdeen: Angling for Insight in Today’s Data Lake, Michael Lock, SVP Analytics and Business Intelligence
Do you want to build
Data Lake?

Thank you!

Building Data Lakes and Analytics on AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Data Lakes and Analytics on AWS

Similar to Building Data Lakes and Analytics on AWS (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Building Data Lakes and Analytics on AWS