Suman Debnath
Principal Developer Advocate, India
Data Engineering with AWS
The volumeof data being produced is increasing
•The number of “smart” devices is
projected to be 200 billion by 2020
(over 100X increase in ten years)
•90% of the data in the world was
generated in the last 2 years
•There are 2.5 quintillion bytes of
data created each day, and this
pace is accelerating
Types of Data
• Structured
- Relational Data
• Unstructured
- Documents, Media files, PDF, etc.
• Semi-structured
- JSON, XML, etc.
Data Lake
A data lake is a centralized repository that allows you to
store all your structured and unstructured data at any
scale
Data Lake
• All data in one place, a single source of truth
• Support different formats
• Supports fast ingestion and consumption
• Schema on read
• Designed for low-cost storage
• Decouples storage and compute
• Supports protection and security rules
Simplified Data Pipeline
Simplified Data Pipeline
Data Sources Ingest
Process &
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
Multiple Data Sources
Multiple Data Sources
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Ingest
Process &
Analyze
Consume
Amazon S3
Catalog
Store
Amazon S3
Store
Fully managed, multi-region, multi-master database
Nonrelational database that delivers reliable performance at any scale
Consistent single-digit millisecond latency
Built-in security, backup and restore, in-memory Caching
Support Streams
Amazon DynamoDB
Process &
Analyze
Consume
Ingest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
Store
Amazon S3
Store
Ingestion Options
Amazon DynamoDB
Kinesis is a managed alternative to Apache Kafka
Application logs, metrics, IoT, clickstreams
“Real time” big data
Streaming processing frameworks (Spark, NiFi, etc...)
Data is automatically replicated synchronously to 3 AZ
Amazon Kinesis
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Kinesis
Video Streams
Amazon Kinesis
Amazon Kinesis
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Firehose
Amazon S3
Amazon Redshift
Amazon
Elasticsearch Service
Amazon Kinesis
• One stream is made of many different shards
• Billing is per shard provisioned, can have as many shards as you want
• Batching available or per message calls.
• The number of shards can evolve over time (reshard / merge)
• Records are ordered per shard
Shard 1
Shard 2
Shard n
ConsumerProducer
Kinesis Stream Shard
Process &
Analyze
Consume
Catalog
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Amazon S3
Store
Amazon S3
Storage Layer
Amazon DynamoDB
Secure, highly scalable, durable object storage with
millisecond latency for data access
Store any type of data–web sites, mobile apps, corporate
applications, and IoT sensors, at any scale
Store data in the format you want:
Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV,
Parquet)
Storage lifecycle integration
Amazon S3-Standard | Amazon S3-Infrequent Access | Amazon Glacier
Amazon S3 is the Base
Data Discovery and Catalog
Store
Amazon S3
Process &
Analyze
Consume
Catalog
AWS Glue
IngestIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Store
Amazon S3
Amazon DynamoDB
Automatically discovers data and stores schema
Data searchable, and available for ETL
Generates customizable code
Schedules and runs your ETL jobs
Serverless
AWS Glue – Serverless Data Catalog and ETL
Process and Analyze
Ingest
Consume
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Store
Amazon S3
Process & AnalyzeIngest
Amazon Kinesis
AWS Snowball
Amazon MSK
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Database
Migration Service
Catalog
AWS Glue
Amazon DynamoDB
Interactive query service to analyze data in Amazon S3
using standard SQL
No infrastructure to set up or manage and no data to
load
Supports Multiple Data Formats – Define Schema on
Demand
Amazon Athena – Interactive Analysis
Querying the Data Lake
Ingest Consume
Amazon Kinesis
BI Tools
Database
Migration Service
AWS Snowball
Amazon MSK
Amazon
Athena
Amazon
EMR
Amazon
Redshift
Amazon
Elasticsearch
Process & Analyze
Jupyter
Notebooks
Amazon
API Gateway
Amazon
QuickSight
Catalog
AWS Glue
Store
Amazon S3
Store
Amazon S3
Data sources
Amazon
DynamoDB
Web logs /
cookies
ERP
Connected
devices
Demo Please
Amazon Kinesis
Data Firehose Amazon S3 AWS Glue Amazon Athena
AWS Lambda Amazon
Comprehend
Demo Please
(Amazon Product Review)
Amazon Kinesis
Data Firehose Amazon S3 AWS Glue Amazon Athena
AWS Lambda Amazon
Comprehend
Demo Please
(Amazon Product Review)
/suman-d
Stay Connected !
Thank You
Suman Debnath
Principal Developer Advocate, India
ml.aws

Data engineering

  • 1.
    Suman Debnath Principal DeveloperAdvocate, India Data Engineering with AWS
  • 2.
    The volumeof databeing produced is increasing •The number of “smart” devices is projected to be 200 billion by 2020 (over 100X increase in ten years) •90% of the data in the world was generated in the last 2 years •There are 2.5 quintillion bytes of data created each day, and this pace is accelerating
  • 3.
    Types of Data •Structured - Relational Data • Unstructured - Documents, Media files, PDF, etc. • Semi-structured - JSON, XML, etc.
  • 4.
    Data Lake A datalake is a centralized repository that allows you to store all your structured and unstructured data at any scale
  • 5.
    Data Lake • Alldata in one place, a single source of truth • Support different formats • Supports fast ingestion and consumption • Schema on read • Designed for low-cost storage • Decouples storage and compute • Supports protection and security rules
  • 6.
    Simplified Data Pipeline SimplifiedData Pipeline Data Sources Ingest Process & Analyze Consume Amazon S3 Catalog Store Amazon S3 Store
  • 7.
    Multiple Data Sources MultipleData Sources Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Ingest Process & Analyze Consume Amazon S3 Catalog Store Amazon S3 Store
  • 8.
    Fully managed, multi-region,multi-master database Nonrelational database that delivers reliable performance at any scale Consistent single-digit millisecond latency Built-in security, backup and restore, in-memory Caching Support Streams Amazon DynamoDB
  • 9.
    Process & Analyze Consume Ingest Amazon Kinesis AWSSnowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Catalog Store Amazon S3 Store Ingestion Options
  • 10.
    Amazon DynamoDB Kinesis isa managed alternative to Apache Kafka Application logs, metrics, IoT, clickstreams “Real time” big data Streaming processing frameworks (Spark, NiFi, etc...) Data is automatically replicated synchronously to 3 AZ Amazon Kinesis
  • 11.
    Amazon Kinesis Data Streams AmazonKinesis Data Firehose Amazon Kinesis Data Analytics Amazon Kinesis Video Streams Amazon Kinesis
  • 12.
    Amazon Kinesis Amazon Kinesis DataStreams Amazon Kinesis Data Analytics Amazon Kinesis Data Firehose Amazon S3 Amazon Redshift Amazon Elasticsearch Service Amazon Kinesis
  • 13.
    • One streamis made of many different shards • Billing is per shard provisioned, can have as many shards as you want • Batching available or per message calls. • The number of shards can evolve over time (reshard / merge) • Records are ordered per shard Shard 1 Shard 2 Shard n ConsumerProducer Kinesis Stream Shard
  • 14.
    Process & Analyze Consume Catalog IngestIngest Amazon Kinesis AWSSnowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Amazon S3 Store Amazon S3 Storage Layer
  • 15.
    Amazon DynamoDB Secure, highlyscalable, durable object storage with millisecond latency for data access Store any type of data–web sites, mobile apps, corporate applications, and IoT sensors, at any scale Store data in the format you want: Unstructured (logs, dump files) | semi-structured (JSON, XML) | structured (CSV, Parquet) Storage lifecycle integration Amazon S3-Standard | Amazon S3-Infrequent Access | Amazon Glacier Amazon S3 is the Base
  • 16.
    Data Discovery andCatalog Store Amazon S3 Process & Analyze Consume Catalog AWS Glue IngestIngest Amazon Kinesis AWS Snowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Store Amazon S3
  • 17.
    Amazon DynamoDB Automatically discoversdata and stores schema Data searchable, and available for ETL Generates customizable code Schedules and runs your ETL jobs Serverless AWS Glue – Serverless Data Catalog and ETL
  • 18.
    Process and Analyze Ingest Consume Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Store AmazonS3 Process & AnalyzeIngest Amazon Kinesis AWS Snowball Amazon MSK Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices Database Migration Service Catalog AWS Glue
  • 19.
    Amazon DynamoDB Interactive queryservice to analyze data in Amazon S3 using standard SQL No infrastructure to set up or manage and no data to load Supports Multiple Data Formats – Define Schema on Demand Amazon Athena – Interactive Analysis
  • 20.
    Querying the DataLake Ingest Consume Amazon Kinesis BI Tools Database Migration Service AWS Snowball Amazon MSK Amazon Athena Amazon EMR Amazon Redshift Amazon Elasticsearch Process & Analyze Jupyter Notebooks Amazon API Gateway Amazon QuickSight Catalog AWS Glue Store Amazon S3 Store Amazon S3 Data sources Amazon DynamoDB Web logs / cookies ERP Connected devices
  • 21.
  • 22.
    Amazon Kinesis Data FirehoseAmazon S3 AWS Glue Amazon Athena AWS Lambda Amazon Comprehend Demo Please (Amazon Product Review)
  • 23.
    Amazon Kinesis Data FirehoseAmazon S3 AWS Glue Amazon Athena AWS Lambda Amazon Comprehend Demo Please (Amazon Product Review)
  • 24.
  • 25.
    Thank You Suman Debnath PrincipalDeveloper Advocate, India ml.aws