2. The volumeof data being produced is increasing
•The number of “smart” devices is
projected to be 200 billion by 2020
(over 100X increase in ten years)
•90% of the data in the world was
generated in the last 2 years
•There are 2.5 quintillion bytes of
data created each day, and this
pace is accelerating
3. Types of Data
• Structured
- Relational Data
• Unstructured
- Documents, Media files, PDF, etc.
• Semi-structured
- JSON, XML, etc.
4. Data Lake
A data lake is a centralized repository that allows you to
store all your structured and unstructured data at any
scale
5. Data Lake
• All data in one place, a single source of truth
• Support different formats
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
7. Multiple Data Sources
Multiple Data Sources
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Ingest
Process &
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
8. Fully managed, multi-region, multi-master database
Nonrelational database that delivers reliable performance at any scale
Consistent single-digit millisecond latency
Built-in security, backup and restore, in-memory Caching
Support Streams
Amazon DynamoDB
9. Process &
Analyze
Consume
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
Store
Amazon S3
Store
Ingestion Options
10. Amazon DynamoDB
Kinesis is a managed alternative to Apache Kafka
Application logs, metrics, IoT, clickstreams
“Real time” big data
Streaming processing frameworks (Spark, NiFi, etc...)
Data is automatically replicated synchronously to 3 AZ
Amazon Kinesis
12. Amazon Kinesis
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Firehose
Amazon S3
Amazon Redshift
Amazon
Elasticsearch Service
Amazon Kinesis
13. • One stream is made of many different shards
• Billing is per shard provisioned, can have as many shards as you want
• Batching available or per message calls.
• The number of shards can evolve over time (reshard / merge)
• Records are ordered per shard
Shard 1
Shard 2
Shard n
ConsumerProducer
Kinesis Stream Shard
15. Amazon DynamoDB
Secure, highly scalable, durable object storage with
millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate
applications, and IoT sensors, at any scale
Store data in the format you want:
Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV,
Parquet)
Storage lifecycle integration
Amazon S3-Standard | Amazon S3-Infrequent Access | Amazon Glacier
Amazon S3 is the Base
16. Data Discovery and Catalog
Store
Amazon S3
Process &
Analyze
Consume
Catalog
AWS Glue
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Store
Amazon S3
17. Amazon DynamoDB
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
AWS Glue – Serverless Data Catalog and ETL
19. Amazon DynamoDB
Interactive query service to analyze data in Amazon S3
using standard SQL
No infrastructure to set up or manage and no data to
load
Supports Multiple Data Formats – Define Schema on
Demand
Amazon Athena – Interactive Analysis
20. Querying the Data Lake
Ingest Consume
Amazon Kinesis
BI Tools
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices