Data engineering

Suman Debnath
Principal Developer Advocate, India
Data Engineering with AWS

The volumeof data being produced is increasing
•The number of “smart” devices is
projected to be 200 billion by 2020
(over 100X increase in ten years)
•90% of the data in the world was
generated in the last 2 years
•There are 2.5 quintillion bytes of
data created each day, and this
pace is accelerating

Types of Data
• Structured
- Relational Data
• Unstructured
- Documents, Media files, PDF, etc.
• Semi-structured
- JSON, XML, etc.

Data Lake
A data lake is a centralized repository that allows you to
store all your structured and unstructured data at any
scale

Data Lake
• All data in one place, a single source of truth
• Support different formats
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules

Simplified Data Pipeline
Simplified Data Pipeline
Data Sources Ingest
Process &
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store

Multiple Data Sources
Multiple Data Sources
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Ingest
Process &
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store

Fully managed, multi-region, multi-master database
Nonrelational database that delivers reliable performance at any scale
Consistent single-digit millisecond latency
Built-in security, backup and restore, in-memory Caching
Support Streams
Amazon DynamoDB

Process &
Analyze
Consume
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
Store
Amazon S3
Store
Ingestion Options

Amazon DynamoDB
Kinesis is a managed alternative to Apache Kafka
Application logs, metrics, IoT, clickstreams
“Real time” big data
Streaming processing frameworks (Spark, NiFi, etc...)
Data is automatically replicated synchronously to 3 AZ
Amazon Kinesis

Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Kinesis
Video Streams
Amazon Kinesis

Amazon Kinesis
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Firehose
Amazon S3
Amazon Redshift
Amazon
Elasticsearch Service
Amazon Kinesis

• One stream is made of many different shards
• Billing is per shard provisioned, can have as many shards as you want
• Batching available or per message calls.
• The number of shards can evolve over time (reshard / merge)
• Records are ordered per shard
Shard 1
Shard 2
Shard n
ConsumerProducer
Kinesis Stream Shard

Process &
Analyze
Consume
Catalog
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Amazon S3
Store
Amazon S3
Storage Layer

Amazon DynamoDB
Secure, highly scalable, durable object storage with
millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate
applications, and IoT sensors, at any scale
Store data in the format you want:
Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV,
Parquet)
Storage lifecycle integration
Amazon S3-Standard | Amazon S3-Infrequent Access | Amazon Glacier
Amazon S3 is the Base

Data Discovery and Catalog
Store
Amazon S3
Process &
Analyze
Consume
Catalog
AWS Glue
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Store
Amazon S3

Amazon DynamoDB
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
AWS Glue – Serverless Data Catalog and ETL

Process and Analyze
Ingest
Consume
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Store
Amazon S3
Process & AnalyzeIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
AWS Glue

Amazon DynamoDB
Interactive query service to analyze data in Amazon S3
using standard SQL
No infrastructure to set up or manage and no data to
load
Supports Multiple Data Formats – Define Schema on
Demand
Amazon Athena – Interactive Analysis

Querying the Data Lake
Ingest Consume
Amazon Kinesis
BI Tools
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices

Amazon Kinesis
Data Firehose Amazon S3 AWS Glue Amazon Athena
AWS Lambda Amazon
Comprehend
Demo Please
(Amazon Product Review)

Thank You
Suman Debnath
Principal Developer Advocate, India
ml.aws

Data engineering

More Related Content

What's hot

Similar to Data engineering

More from Suman Debnath

Recently uploaded

Data engineering