More Related Content Similar to Data_Analytics_and_AI_ML Similar to Data_Analytics_and_AI_ML (20) More from Amazon Web Services More from Amazon Web Services (20) Data_Analytics_and_AI_ML1. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data Analytics and
Machine Learning on AWS
2. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHAT IS BIG DATA?
3. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Big Data and the 3Vs
Variety
Velocity
Volume
4. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Elastic and highly scalable
No upfront capital expense
Only pay for what you use
+
+
Available on-demand
+
The Cloud Advantage
5. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
BIG DATA ANALYTICS
6. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Examples of Business Outcomes and Insights
Ø Security threat detection
Ø User Behavior Analysis
Ø Enhanced customer experience
Ø Business Intelligence
Ø Spending optimization
Ø Real-time alerting
7. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Relational Databases
NoSQL Databases
Web servers
Mobile phones/Tablets
3rd party feeds
IoT
Clickstream
Examples of Big Data Sources
8. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Examples of AWS Services for Big Data Analytics
EMR EC2
Glacier
S3
Import Export
Kinesis
Direct Connect
Machine LearningRedshift
DynamoDB
AWS Database
Migration Service
AWS Lambda
AWS IoT
AWS Data Pipeline
Amazon KinesisAnalytic
Analytics
Amazon
SNS
AWS Snowball
Amazon
SWF
AmazonAthena
Amazon
QuickSight
Amazon AuroraAWS Glue
9. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon S3—Object Storage
Security and
Compliance
Three different forms of
encryption; encrypts data
in transit when
replicating across regions;
log and monitor with
CloudTrail, use ML to
discover and protect
sensitive data with Macie
Flexible Management
Classify, report, and
visualize data usage
trends; objects can be
tagged to see storage
consumption, cost, and
security; build lifecycle
policies to automate
tiering, and retention
Durability, Availability
& Scalability
Built for eleven nine’s of
durability; data
distributed across 3
physical facilities in an
AWS region;
automatically replicated
to any other AWS region
Query in Place
Run analytics & ML on
data lake without data
movement; S3 Select can
retrieve subset of data,
improving analytics
performance by 400%
10. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift—Data Warehousing
Fast at scale
Columnar storage
technology to improve
I/O efficiency and scale
query performance
Secure
Audit everything; encrypt
data end-to-end;
extensive certification
and compliance
Open file formats
Analyze optimized data
formats on the latest
SSD, and all open data
formats in Amazon S3
Inexpensive
As low as $1,000 per
terabyte per year, 1/10th
the cost of traditional
data warehouse
solutions; start at $0.25
per hour
$
11. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Redshift Spectrum
Extend the data warehouse to exabytes of data in S3 data lake
S3 data lakeRedshift data
Redshift Spectrum
query engine • Exabyte Redshift SQL queries against S3
• Join data across Redshift and S3
• Scale compute and storage separately
• Stable query performance and unlimited concurrency
• CSV, ORC, Grok, Avro, & Parquet data formats
• Pay only for the amount of data scanned
12. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon EMR—Big Data Processing
Low cost
Flexible billing with per-
second billing, EC2 spot,
reserved instances and
auto-scaling to reduce
costs 50–80%
$
Easy
Launch fully managed
Hadoop & Spark in
minutes; no cluster
setup, node provisioning,
cluster tuning
Latest versions
Updated with the latest
open source frameworks
within 30 days of release
Use S3 storage
Process data directly in
the S3 data lake securely
with high performance
using the EMRFS
connector
Data Lake
1001100001001010111
0010101011100101010
0000111100101100101
010001100001
13. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Elasticsearch Service
Easy to Use
Fully managed;
Deploy production-ready
clusters in minutes
Secure
Secure access with VPC
to keep all traffic within
AWS network
Open
Direct access to
Elasticsearch open-source
APIs; supports Logstash
and Kibana
Available
Zone awareness
replicates data between
two AZs; automatically
monitors & replaces
failed nodes
$
14. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis—Real Time
time
Load data streams
into AWS data stores
Kinesis Data
Firehose
Build custom
applications that
analyze data streams
Kinesis Data
Streams
Capture, process, and
store video streams
for analytics
Kinesis Video
Streams
New
Analyze data streams
with SQL
Kinesis Data
Analytics
SQL
15. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Athena—Interactive Analysis
Interactive query service to analyze data in Amazon S3 using standard SQL
No infrastructure to set up or manage and no data to load
Ability to run SQL queries on data archived in Amazon Glacier (coming soon)
Query Instantly
Zero setup cost; just
point to S3 and
start querying
SQL
Open
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types,
and complex joins and
data types
Easy
Serverless: zero
infrastructure, zero
administration
Integrated with
QuickSight
Pay per query
Pay only for queries
run; save 30–90% on
per-query costs
through compression
$
16. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon QuickSight
easy
Empower
everyone
Seamless
connectivity
Fast analysis Serverless
17. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
ORIGIN DESTINATION
Insight
consumers
18. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
ORIGIN DESTINATION
19. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
ORIGIN DESTINATION
20. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data analysts
Data scientists
Business users
Engagement platforms
Automation / events
Transactions
Web logs /
cookies
ERP
Connected
devices
Social media
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Semi/Unstructured
Amazon EMR
21. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
22. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Insights to enhance business applications, new digital services
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
AWS
Cloud Trail
AWS
IAM
Amazon
CloudWatch
AWS
KMS
23. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Data analysts
Data scientists
Business users
Engagement platforms
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Automation / events
Amazon S3
Staged Data
(Data Lake)
Semi/Unstructured
Amazon EMR
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces
Amazon
Kinesis
Connected
devices
Social media
Amazon S3
Raw Data
Amazon EMR
ETL
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR
24. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Speed (Real-time)
Ingest ServingData
sources
Scale (Batch)
Modern data architecture
Transactions
Web logs /
cookies
ERP
AWS Database
Migration
AWS Direct
Connect
Internet
Interfaces Amazon S3
Raw Data
Amazon S3
Staged Data
(Data Lake)
Amazon EMR
ETL
Data analysts
Data scientists
Business users
Engagement platforms
Amazon
Kinesis
Connected
devices
Social media
Advanced
Analytics
MLlib
Event Capture
Amazon Kinesis
Stream Analysis
Amazon EMR Event Scoring
Amazon AI
Event Handler
AWS Lambda Response Handler
AWS Lambda
Automation / events
Data Warehouse
Amazon Redshift
Legacy Apps
Amazon RDS
Schemaless
Amazon ElasticSearch
Direct Query
Amazon Athena
Near-Zero Latency
Amazon DynamoDB
Semi/Unstructured
Amazon EMR
25. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
A Sample Batch Analytics Pipeline
Ad-hoc access to data using Athena
Athena can query
aggregated datasets as well
26. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Smart Applications | Machine Learning
27. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Clickstream Analysis
28. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Customer Success.
Powered by AWS.
29. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sysco is the leader
in selling, marketing,
and distributing food.
Challenge:
Large volumes of data in
multiple systems. Also, high costs
from maintaining on-premises
EDW deployment.
Solution:
• Migrated their on-premises
solution to the cloud with
Redshift, S3, EMR, and Athena
30. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analytics on the Data Lake
• Sysco is the leader in
selling, marketing, &
distributing food
• Challenge: large volumes of
data in multiple systems
• Consolidated data into
a single S3 data lake
• Data scientists use EMR
notebooks, Athena &
Amazon Redshift Spectrum
used by business users
for reporting
Redshift
ETL
process
Data
preparation
Ingest raw data from
multiple sources
S3
Redshift
Spectrum
Athena
EMR
Marketing
data source
Other source
systems
Transformed
data
S3
31. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINRA oversees > 3,000
securities firms doing
business in the United States.
Challenge:
FINRA’s legacy system did not
scale well
• Up to 75 billion events per day
• Run complex surveillance queries
over 20+ PB of data
Solution:
• Migrated their big data appliance
to a S3 Data Lake and used EMR
for ingestion and processing
• Migrated to RDS and testing Aurora
32. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
FINRA uses S3 to Build Data Lake with EMR
• Required fast access
across trillions of trade
records (20PB+)
• Migrated from
on-premises system
• Use Apache HBase on
Amazon EMR to store
and serve this data
• Use EMR engines—
Spark, Presto, and Hive
to process data
• Lower costs by 60% over
on-premises system
Spark
on EMR
Presto
on EMR
Hive
on EMR
S3
Herd
Metastore
HBase
on EMR
33. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nasdaq operates financial
exchanges around the
world, and processes
large volumes of data.
Challenge:
Nasdaq wanted to make their large
historical data footprint available
to analyze as a single dataset.
Solution:
• Use Amazon Redshift for
interactive querying
• Use Amazon S3 as a Data Lake,
and Presto on EMR to process
historical data
34. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Nasdaq Uses AWS to Build a Data Lake
• Migrate legacy
on-premises warehouse
to Amazon Redshift
• 4.8B rows inserted
per trading day
(orders, trades, quotes)
• Ingest data from multiple
sources, validates, and
stages in S3
• Redshift reads data out of
S3 for fast queries
• Presto on EMR and S3 used
for analysis of massive
historical data set
Data from all 7 exchanges
operated by Nasdaq
(orders, quotes, trade executions)
Flat
files
Operational
Databases
EMR
Redshift
S3
SQL
Clients
35. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake Overview
36. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
• A centralized repository for both structured
and unstructured data
• Store data as-is in open-source file
formats to enable direct analytics
What is a Data Lake?
37. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why a Data Lake?
• Decouple storage from compute, allowing
you to scale
• Enable advanced analytics across all of
your data sources
• Reduce complexity in ETL and
operational overhead
• Future extensibility as new database and
analytics technologies are invented
38. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Traditionally, Analytics Looked Like This
OLTP ERP CRM LOB
Data Warehouse
Business Intelligence
TBs-PBs Scale
Schema Defined Prior to Data Load
Operational and Ad Hoc Reporting
Large Initial Capex + $$K / TB/ Year
Relational Data
39. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lakes Extend the Traditional Approach
OLTP ERP CRM LOB
Data Lake
1001100001001010111001
0101011100101010000101
1111011010001111001011
0010110
0100011000010
Catalog
DW
Queries
Big Data
Processing
Interactive Real-Time
Web Sensors SocialDevices
Business Intelligence Machine Learning TB-EBs Scale
All Data in one place, a Single Source of Truth
Relational and Non-Relational Data
Decouples (low cost) Storage and Compute
Schema on Read
Diverse Analytical Engines
40. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – All Data in One Place
Store and analyze all of your data,
from all of your sources, in one
centralized location.
“Why is the data distributed in
many locations? Where is the
single source of truth ?”
41. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Quick Ingest
Quickly ingest data
without needing to force it into a
pre-defined schema.
“How can I collect data quickly
from various sources and store
it efficiently?”
42. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Storage vs Compute
Separating your storage and compute
allows you to scale each component as
required
“How can I scale up with the
volume of data being generated?”
43. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple
analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc
analysis by applying schemas
on read, not write.
44. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Building a Data lake on AWS
45. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Why AWS?
Implementing a Data Lake architecture requires a broad
set of tools and technologies to serve an increasingly
diverse set of applications and use cases.
46. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Lake on AWS
Catalog & Search Access & User Interfaces
Data Ingestion
Analytics & Serving
S3
Amazon
DynamoDB
Amazon Elasticsearch
Service
AWS
AppSync
Amazon
API Gateway
Amazon
Cognito
AWS
KMS
AWS
CloudTrail
Manage & Secure
AWS
IAM
Amazon
CloudWatch
AWS
Snowball
AWS Storage
Gateway
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
AWS Database
Migration
Service
Amazon
Athena
Amazon
EMR
AWS
Glue
Amazon
Redshift
Amazon
DynamoDB
Amazon
QuickSight
Amazon
Kinesis
Amazon
Elasticsearch
Service
Amazon
Neptune
Amazon
RDS
Central Storage
Scalable, secure, cost-
effective
AWS
Glue
47. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Designed for 11 9s
of durability
Designed for
99.99% availability
Durable Available High performance
§ Multiple upload
§ Range GET
§ Store as much as you need
§ Scale storage and compute
independently
§ No minimum usage commitments
Scalable
§ Amazon EMR
§ Amazon Redshift
§ Amazon DynamoDB
Integrated
§ Simple REST API
§ AWS SDKs
§ Read-after-create consistency
§ Event notification
§ Lifecycle policies
Easy to use
Why Amazon S3 for a Data Lake?
48. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What can you do with a Data Lake?
49. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Query Directly with Amazon Athena
50. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Analyze with Hadoop on Amazon EMR
51. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create Visualizations with Amazon QuickSight
52. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Train ML Models with Amazon SageMaker
53. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Create a Central Data Catalog with AWS Glue
54. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Load into Downstream Services
AURORAAmazon Redshift
Amazon DynamoDB
Amazon Aurora
Amazon Elasticsearch
Run complex analytic queries against
petabytes of structured data
A NoSQL database service that
delivers consistent, single-digit
millisecond latency at any scale.
A MySQL and PostgreSQL compatible relational
database built for the cloud
Delivers Elasticsearch’s real-time analytics
capabilities alongside the availability,
scalability, and security that production
workloads require.
55. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Movement into the Data Lake
56. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Sources
FilesLogsStreamsDatabases
57. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Sources - Databases
Amazon S3Databases
58. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Change Data Capture
Techniques to Capture Changes
• Timestamp
• Diff Comparison
• Triggers
• Transaction Log
59. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Change Data Capture – Timestamp
4/18/18 300
3/12/18 800
9/25/17 230
2/04/18 100
4/18/18 300
7/16/19 1600
9/25/17 230
2/04/18 100
Last Run: 7/16/19 1400
Kinesis Data Firehose Amazon S3
60. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Change Data Capture – Diff Compare
6/15/18 0300
6/16/18 0300
20180615T0300
20180616T0300
Diff Compare Kinesis Data Firehose Amazon S3
61. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Change Data Capture – Triggers
SELECT
Id: 20982358
Name: Jean-Luc Picard
Rank: Captain
State: Agitated
Roster
ChangeData
Table: Roster
Id: 20982358
Operation: Update
Job: ag8afh8 ChangeDataBatch
SELECT
Table: Roster
Id: 20982358
Operation: Update
Amazon S3
Write operations to Firehose
Kinesis Data Firehose
62. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Change Data Capture – Database Logs
LOG_FILE_HDR_SIZE
OS_FILE_LOG_BLOCK
_SIZE
FORMAT
CHECKSUM
LOG_CHECKPOINT_1
LOG_CHECKPOINT_2
Checkpoint_lsn
Checkpoint_no
Log.buf_size
LOG BLOCK
LOG_BLOCK_HDR_SIZ
E
Hdr_no
[…]
???
Tx001.log
63. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Database Migration Service (AWS DMS) easily
and securely migrate and/or replicate your databases
and data warehouses to AWS
AWS Schema Conversion Tool (AWS SCT) convert your
commercial database and data warehouse schemas to open-
source engines or AWS-native services, such as Amazon
Aurora and Redshift
Database Migration Service
64. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Modernize Migrate Replicate
Modernize your database tier –
• Commercial to open-source
• Commercial to Amazon Aurora
Modernize your Data Warehouse –
• Commercial to Redshift
• Migrate business-critical applications
• Migrate from Classic to VPC
• Migrate data warehouse to Redshift
• Upgrade to a minor version
• Create cross-regions Read Replicas
• Run your analytics in the cloud
• Keep your dev/test and production
environment sync
When to use DMS and SCT?
65. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Sources - Files
Amazon S3Files
66. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Files
Optimizing Transfers Available Services
• S3 Multi-Part Upload
• S3 Transfer Acceleration
• AWS Direct Connect
• AWS DataSync
• AWS Transfer - SFTP
• AWS Snowball/Snowmobile
67. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Uploading to Amazon S3
• Amazon S3 supports both a single-part upload
and a multi-part upload API
• The single-part upload supports objects up to 5
GB in size
• The multi-part upload supports objects up to 5
TB in size
• The multi-part upload also enables you to
maximize your throughput by using parallel
threads
68. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PUT requests go through the nearest AWS Edge
Location
Data transits over the AWS private network rather
than Internet
AWS private network optimizes throughput and
latency to the AWS Region
Data is not stored in the edge cache
S3 Transfer Acceleration
S3 bucket
AWS edge
location
Uploader
69. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Direct Connect
Amazon S3
VPC Endpoint
Customer
Gateway
Corporate Data Center
AWS Region
Virtual Private Cloud
EC2
Direct Connect Location
Customer/Partner
Cage
AWS Cage
Customer/Partner
Router
AWS Direct Connect
Endpoint
Private Virtual Interface
Public Virtual Interface
70. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS DataSync
Online transfer service that simplifies, automates, and
accelerates moving data between on-premises storage and AWS
Fast data
transfer
Cost-
effective
Combines the speed and reliability of network acceleration
software with the cost-effectiveness of open source tools
Easy to use Secure and
reliable
Cloud
integrated
AWS
71. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Transfer for SFTP
Fully managed SFTP service for Amazon S3
Native integration
with AWS services
Simple
to use
Cost-effective
Fully managed
in AWS Secure and Compliant
Seamless migration
of existing workflows
72. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS Snowball/Snowmobile
Use Case AWS Solution
Cloud Migration, Disaster Recovery AWS Snowball
Internet of Things (IoT), Remote
Locations
AWS Snowball Edge
Migrating Exabytes of Data AWS Snowmobile
73. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Data Sources - Streams
Amazon S3Streams
74. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streams
Collecting and Analyzing
• Amazon Kinesis
• Amazon Managed Streaming for Kafka (MSK)
• Example: Clickstream Analytics
75. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon Kinesis - Stream Processing on AWS
Firehose
• Buffer records in a stream into a
single output for more efficient
storage
• Automatic flushing of buffer to S3,
ElasticSearch, Redshift, or Splunk
Analytics
• Create time windows over streams
and perform aggregate operations
using SQL
• Join together multiple streams and
output to new streams
Streams
• Capture streaming data for
downstream processing
• Allow multiple processors to read
streams at their own rate
76. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Summary - Ingestion
s3://datalake/
/vendorfeeds
/vendorA
/vendorB
/clickstream
/orders
/vendors
/customers
/app_logs
/instance1
/instance2
/syslogs
/instance1
/instance2
/databases
/customers
/orders
/vendors
API Gateway
Kinesis Agent
DMS
Kinesis Data Firehose
Amazon S3
Files
Streams
Logs
Databases
AWS DataSync
77. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Consuming Data from the Data Lake
78. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anti-Pattern
Everything
Query
79. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Also an Anti-Pattern
Everything
Query
80. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
One tool to
rule them all
81. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Where do I start?
• Understand your data
• Data Structure, Access patterns & characteristics,
Temperature, Cost, Size
• Know your audience
• Business Users, Data Scientists, Developers
• Select the right service
82. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Archival
In-memory Warehouse
NoSQL
Hot data Warm data Cold data
Data
Structure
Low
High
Object
Search
Understand your Data
Latency
Data volume
HighLow
Request rate
Cost / GB
High Low
83. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Amazon
ElastiCache
Amazon ES
Amazon
DynamoDB Amazon S3 Amazon Glacier
Hot data Warm data Cold data
Data
Structure
Low
High
Understand your Data
Latency
Data volume
HighLow
Request rate
Cost / GB
High Low
NoSQL
Object
Archival
Search
In-Memory
Warehouse
Amazon Redshift
84. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
PRIORITIES NEEDS
Creating engaging visual and narrative journeys
for analytical solutions
Data Visualizer
Manages data as a product. Ensures freshness and
consistency of data; understands lineage and
compliance needs; treats DS as customers
Data Product
Manager
Monitoring for reliability, quickly diagnose
deployment or availability issues
DevOps
Engineer
ROLE
Visualization
Dashboards
Reporting
Reports – data quality, errors
Ad hoc querying
Dashboards
Makes sense of data, generates and communicates
insights to improve or create business processes,
creates predictive ML models to support them
Data Scientist
Ad hoc querying
Robust ML tools
Builds scalable pipelines, transforms and loads data
into structures complete with metadata that can be
readily consumed by DS
Data
Engineer
Ad hoc querying
Quick visualization
Vetting the priortization and ROI, funding projects,
providing ongoing feedback
Business
Sponsor
Reporting
Dashboards
85. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Overview of AI/ML
86. Machine Learning
Learning without being
explicitly programmed
Artificial Intelligence
Machines or programs
exhibiting intelligence
Deep Learning
Learning based on
Deep Neural Networks
AI vs Machine Learning vs Deep Learning
87. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Closer Look at Machine Learning
and when do you use it
88. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
89. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
43,252,003,274,489,856,000
43 QUINTILLION
UNIQUE COMBINATIONS
90. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
F2 U' R' L F2 R L' U'
Learning
function
91. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
F2 U' R' L F2 R L' U'
Learning
function
1%
accuracy
R U r U R U2 r U2%
accuracy
92. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learning
function
20%
accuracy
40%
accuracy
60%
accuracy
80%
accuracy
95%
accuracy
2%
accuracy
93. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Learning
function
95%
accuracy
?
F2 R F R′ B′ D F D′ B D F
94. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
95. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Don’t code the patterns; let
the system learn through data
96. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Train a model
positive/negative
reinforcement
Infer from a model
to obtain a
prediction
Data
Feedback
Model
98. Supervised Learning – How Machine Learn
Human intervention and validation required
e.g. Photo classification and tagging
Input
Label
Machine
Learning
Algorithm
Dog
Prediction
Cat
Training Data
?
Label
Dog
Adjust Model
100. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Retail
Demand Forecasting
Vendor Lead Time
Prediction
Pricing
Packaging
Substitute Prediction
Customers
Recommendation
Product Search
Product Ads
Shopping Advice
Customer Problem
Detection
Catalogue
Browse-Node
Classification
Meta-data Validation
Review Analysis
Product Matching
Text
In-Book Search
Named-entity
Extraction
Summarisation/Xray
Plagiarism Detection
Seller
Fraud Detection
Predictive Help
Seller Search &
Crawling
Images
Visual Search
Product Image
Enhancement
Brand Tracking
Machine Learning at Amazon.com
102. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Alexa, Hello!
103. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
104. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
105. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
106. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
108. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Put AI and ML in the hands of every developer
and data scientist
Our Mission at AWS
109. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
M L F R A M E W O R K S &
I N F R A S T R U C T U R E
A I S E R V I C E S
R E K O G N I T I O N
I M A G E
P O L L Y T R A N S C R I B E T R A N S L A T E C O M P R E H E N D L E XR E K O G N I T I O N
V I D E O
Vision Speech Language Chatbots
A M A Z O N
S A G E M A K E R
B U I L D T R A I N
F O R E C A S T
Forecasting
T E X T R A C T P E R S O N A L I Z E
Recommendations
D E P L O Y
Pre-built algorithms & notebooks
Data labeling (G R O U N D T R U T H )
One-click model training & tuning
Optimization (N E O )
One-click deployment & hosting
M L S E R V I C E S
F r a m e w o r k s I n t e r f a c e s I n f r a s t r u c t u r e
E C 2 P 3
& P 3 N
E C 2 C 5 F P G A s G R E E N G R A S S E L A S T I C
I N F E R E N C E
Reinforcement learningAlgorithms & models ( A W S M A R K E T P L A C E
F O R M A C H I N E L E A R N I N G )
111. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Vision: Amazon Rekognition
Key Features
Object & Scene Detection
Image Moderation
Facial Analysis
Facial Comparison
Facial Recognition
Celebrity Recognition
113. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Object and Activity
Detection
Person
Tracking
Face
Recognition
Real-time Live
Stream
Content Moderation Celebrity
Recognition
Vision: Amazon Rekognition Video
Video Analysis
114. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Speech: Amazon Polly
Key Features
• 50 Voices
• 24 Languages
• Lip-Syncing & Text Highlighting
• Fine-grained Voice Control
• Custom Vocabularies
115. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Language: Amazon Lex
Conversational interfaces for your applications, powered by the same Natural
Language Understanding (NLU) & Automatic Speech Recognition (ASR) models
as Alexa
118. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
ML can be very complicated
1
2
3
1
2
3
119. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon SageMaker: build, train, and deploy ML at Scale
1
2
3
1
2
3
120. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
2
3
1
2
3
Amazon SageMaker: build, train, and deploy ML at Scale
121. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
2
3
1
2
3
Amazon SageMaker: build, train, and deploy ML at Scale
122. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
2
3
1
2
3
Amazon SageMaker: build, train, and deploy ML at Scale
123. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
2
3
1
2
3
Amazon SageMaker: build, train, and deploy ML at Scale
124. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
1
2
3
1
2
3
Amazon SageMaker: build, train, and deploy ML at Scale
125. How do you make it easier to obtain
high quality labeled data?
126. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon SageMaker: Build, train, and deploy ML
127. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Successful models require high-quality data
128. Build highly accurate training datasets and reduce data
labeling costs by up to 70% using machine learning
129. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Amazon SageMaker ground truth
Label machine learning training data easily and accurately
130. © 2019, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank You