Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Shafreen Sayyed
Solutions Architect, Amazon Web Services
Using Data Lakes to quench your
Analytics fire

Forces and Trends Prompting the Move to Cloud
Cost Optimization
Licenses
Hardware
Data center and operations
Dark Data
Prematurely discarding data
Agility
Experimentation (data & tools)
Democratised Access to Data
Time-to-first-results
Terminate failed experiments early
From BI to Data Science
In-house data science
From back office to product

Storage is the Gravity for Cloud Applications
Store all your data, for ever, at every stage of its lifecycle
Apply it using the right tool for the job

Where do I start?
Ingest /
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers and
insights

Where do I start?
Ingest /
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers and
insights
Start here
(with a business case)

Storage is Job #1

Object Storage is Foundational

Data storage
Amazon
S3
Amazon
DynamoDB
Amazon
Elasticsearch
Service
Amazon RDS
Versioning
Lifecycle
Management
5 TB Objects
Designed for
99.999999999%
Durability
Replication
Reliability
Security
Scalability

Standard
Active data Archive dataInfrequently accessed data
Standard - Infrequent
Access
Amazon Glacier
Create
Delete
Events and Lifecycle Management

S3 as the Data Lake Fabric
• Unlimited number of objects
and volume
• 99.99% availability
• 99.999999999% durability
• Versioning
• Tiered storage via lifecycle
policies
• SSL, client/server-side
encryption at rest
• Low cost (just over
$2700/month for 100TB)
• Natively supported by big
data frameworks (Spark, Hive,
Presto, etc)
• Decouples storage and
compute
• Run transient compute
clusters (with Amazon EC2
Spot Instances)
• Multiple, heterogeneous
clusters can use same data

Database Migration
Service
Automated Data Ingestion

Stream Events to S3 Using Kinesis Firehose

Write Database Changes to S3 with DMS
<schema_name>/<table_name>/LOAD001.csv
<schema_name>/<table_name>/LOAD002.csv
<schema_name>/<table_name>/<time-stamp>.csv
Full Load
Change Data Capture

Data ingestion
Amazon
Kinesis
AWS IoT
• Fully-managed real-time
stream processing
• Highly available across
multiple AZs
• Can capture and store:
• Terabytes of data per hour
• From hundreds of thousands
sources
• Collect data from your
connected devices
• Communicate securely back to
your devices
• Can easily support:
• Billions of devices
• Trillions of messages
“If you knew the state of every thing in the world, and could
reason on top of that data, what problems could you solve?”

Data collection
• Dedicated 1 Gbps and 10
Gbps fibre link to AWS
• Low cost, with consistent
low latency/jitter
• Direct access to AWS
services and your VPCs
• Tamper-resistant case
and electronics
• Ruggedized case that
can withstand 8.5 G
• Available in 50 TB or 80
TB capacities
AWS
Snowball
AWS Database
Migration Service
• Modernise, migrate, or
replicate your RDBMS
• Fan-in multiple sources
to single target
• Platform and schema
conversion
AWS Direct
Connect

Scalable (secure, versioned, durable) storage +
Immutable data at every stage of its lifecycle +
Versioned schema and metadata
=
Data discovery, lineage
Storage + Catalog

AWS Glue
Data Catalog
Discover and store metadata
Job Execution
Serverless scheduling and execution
Job Authoring
Auto-generated ETL code

Hive metastore-compatible, highly-
available metadata repository:
• Classification for identifying and
parsing files
• Versioning of table metadata as
schemas evolve
• Table definitions – usable by
Redshift, Athena, Glue, EMR
Populate using Hive DDL, bulk import,
or automatically through crawlers.
Glue Data Catalog

semi-structured
per-file schema
semi-structured
unified schema
identify file type
and parse files
enumerate
S3 objects
file 1
file 2
file N
…
int
array
intchar
struct
char int
array
struct
char
int
int
arrayint
char
char int
custom classifiers
app log parser
metrics parser
…
system classifiers
JSON parser
CSV parser
Apache log parser
…
bool
Crawlers: Automatic Schema Inference
bool

AWS Lambda
Metadata Index
(Amazon DynamoDB)
Search Index
(Amazon Elasticsearch)
ObjectCreated
ObjectDeleted PutItem
Update Stream
Update Index
Extract Search Fields
Indexing and Searching Using Metadata
Amazon S3

Data processing and analysis
• Petabyte scale data
warehouse
• Fault-tolerant scalable
cluster with node auto-
recovery
• Auto backup into
Amazon S3
Amazon
Redshift
Structured
data processing
• Fully-managed big data
platform
• Auto-scaling clusters
• Supports Hadoop:
• Hive, Spark, Presto
• Zeppelin, HBase, Flink
• HDFS and Amazon S3
filesystems
Amazon
EMR
Semi/unstructured
data processing
• No infrastructure to manage
• No data loading required
• Supports multiple data
formats:
• CSV, TSV, Avro, ORC, Parquet
• Uses ANSI SQL to directly
query Amazon S3
Amazon
Athena
Serverless
query processing

Consume and visualise data
• No infrastructure to
manage
• Event-driven processing
• Pay per 100 ms CPU
• Node.js, Python, Java
and C# (.NET Core)
AWS
Lambda
Compute
platforms
• No infrastructure to
manage
• Multiple classifier types
• Interactive UI for modelling
and dataset visualisation
Amazon
Machine Learning
Machine learning
• No infrastructure to manage
• Fast, cloud-powered BI tool
• Scales to hundreds of
thousands of users
• Quick calculations with SPICE
Amazon
QuickSight
Business
intelligence

Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at
$0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift has security built in
SSL to secure data in transit
Encryption to secure data at rest
• AES-256; hardware accelerated
• All blocks on disks and in Amazon S3 encrypted
• HSM support
No direct access to compute nodes
Audit logging, AWS CloudTrail, AWS KMS
integration
Amazon VPC support
SOC 1/2/3, PCI-DSS Level 1, FedRAMP, HIPAA
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Amazon S3/Amazon DynamoDB
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node

Redshift Spectrum
...
1 2 3 4 N
Amazon S3
Exabyte-scale object storage
Data Catalog
Apache Hive Metastore
10 GigE
(HPC)
Ingestion
Backup
Restore
SQL Clients/BI Tools
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
128GB RAM
16TB disk
16 cores
Customer VPC
Internal
VPC
JDBC/ODBC
Leader
Node
Compute
Node
Compute
Node
Compute
Node
Leverages Amazon Redshift’s advanced cost-
based optimizer
Pushes down projections, filters, aggregations
and join reduction
Dynamic partition pruning to minimize data
processed
Automatic parallelization of query execution
against Amazon S3 data
Efficient join processing within the Amazon
Redshift cluster
Spectrum
Nodes
Redshift
Nodes

Amazon Redshift works with third-party analysis tools
JDBC/ODBC
Amazon Redshift
Amazon
QuickSight
New!

Security is Job #0

Data Access & Authorisation
Give your users easy and secure access
Storage & Catalog
Secure, cost-effective storage
in Amazon S3. Robust
metadata in AWS Catalog
Protect and Secure
Use entitlements to ensure data is secure
and users’ identities are verified

Identity and Access Management
• Manage users, groups, and roles
• Identity federation with Open ID
• Temporary credentials with Amazon Security
Token Service (Amazon STS)
• Stored policy templates
• Powerful policy language
• Amazon S3 bucket policies

IAM
Amazon
S3
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
EMR
Amazon
Kinesis
Amazon
Athena
Service API Access
Security at the Data Level

Third Party Ecosystem Security Tools
Amazon
S3
AWS
CloudTrail
http://amzn.to/2tSimHj
Amazon
Athena
Access Logging
API Logging
Access Log
Analytics
IAM
Amazon
EMR
http://amzn.to/2si6RqS
Storage Level Support for Access Logging and Audit

Encryption Options
AWS Server-Side encryption
• AWS managed key infrastructure
AWS Key Management Service
• Automated key rotation & auditing
• Integration with other AWS services
AWS CloudHSM
• Dedicated Tenancy SafeNet Luna SA HSM Device
• Common Criteria EAL4+, NIST FIPS 140-2

Serverless Processing and Analytics

• Python code generated
by AWS Glue
• Connect a notebook or
IDE to AWS Glue
• Existing code brought
into AWS Glue
Managed ETL with AWS Glue

• Schedule-based
• Event-based
• On demand
Job Execution with AWS Glue

Amazon Athena – Analyze Data in S3
• Interactive queries
• ANSI SQL
• No infrastructure or administration
• Zero spin up time
• Query data in its raw format
• AVRO, Text, CSV, JSON, weblogs, AWS service logs
• Convert to an optimized form like ORC or Parquet for the
best performance and lowest cost
• No loading of data, no ETL required
• Stream data from directly from Amazon S3, take
advantage of Amazon S3 durability and availability

Simple query editor
with syntax highlighting
and autocomplete
Data Catalog
Query History, Saved Queries, and
Catalog Management

QuickSight allows you to connect to data from a wide variety of AWS, third-party, and on-premises
sources including Amazon Athena
Amazon RDS
Amazon S3
Amazon Redshift
Amazon Athena
Using Amazon Athena with Amazon QuickSight

Amazon Kinesis Analytics
• Interact with streaming data in real time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time
visualizations and alarms

SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING)
WHERE text LIKE ‘%#BigDataCapeTown%';
Amazon Kinesis Analytics – Simple SQL Interface

Again, where do I start?
Ingest /
Collect
Consume/
visualize
Store Process/
analyze
Data
1 4
0 9
5
Answers and
insights
Seriously, start here
(with a business case)
Then collect your data

Security &
Governance IAM
Amazon
CloudWatch
AWS
CloudTrail
AWS
KMS
AWS
CloudHSM
AWS Directory
Service
Data Catalog Amazon Athena
Catalog
RDS
Hive
Metastore EMR RDS
Glue
Catalog
Amazon
Cognito

Summary

Build decoupled systems
• Use Amazon S3 as the data fabric of your data lake
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use log-centric design patterns
• Immutable log, batch, interactive & real-time views
Be cost-conscious
• Big data ≠ big cost

Thank you!

Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018

Similar to Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018