SlideShare a Scribd company logo
1 of 160
Download to read offline
AWS Partners: Data Analytics on AWS – Technical
Parvesh Chopra choprapa@amazon.com
Module 1: Course
Introduction
Course objectives
In this course, you will learn how to:
• Identify Amazon Web Services (AWS) services in the AWS analytics stack
• Describe decision points and technology selections for data analytics architectures
• Design highly available and fault-tolerant serverless data analytics architectures
• Discuss the AWS Data Pipeline and the customer data analytics journey using the Data
Flywheel
• Describe five AWS data analytics technical solutions:
• Modernizing a data warehouse with Amazon Redshift
• Data lakes
• Streaming data
• Data governance
• Machine learning (ML)
• Identify technical engagement strategies and best practices for delivering a proof of
concept (POC)
• Locate and use AWS Partner Network (APN) Partner resources for opportunities and training
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
5
About this course
• This course is for technical professionals at APN Consulting Partner
organizations who are engaged in pre-sales discussions with customers to
help architect data analytic solutions on AWS and answer technical questions
about using AWS data analytics services.
• This 1-day course is focused on educating technical professionals with
sufficient technical knowledge on AWS data analytics services and solutions to
successfully engage with and help customers.
• This course is not designed to be a technical deep dive into AWS data
analytics services and solutions. It provides the necessary resources and
learning path towards gaining deeper knowledge into the services.
6
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Agenda
7
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Module 1: Course Introduction
Module 2: AWS Data Analytics Stack
Portfolio
Break
Module 3: AWS Data Analytics Solutions
– Part I
- Data lake solution
Break
Module 4: AWS Data Analytics Solutions
– Part II
Break
Module 5: Technical Engagement
Strategies
Module 6: APN Partner Opportunities
and Resources
Module 2: AWS Data Analytics
Portfolio
Objectives
In this module, you will learn how to:
• Understand customer challenges related to data analytics in their business
• Provide a technical overview of AWS data analytics portfolio
• Discuss technical advantages and position of data analytics solutions on
AWS
• Explain how to build a data analytics pipeline
• Explain the Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
9
Customer challenges and
opportunities for APN
Partners
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
10
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
New realities
By making 10% more data accessible, a typical Fortune 1000
company will see a $65 million increase in net income.*
Explosion of data-
connected devices, apps,
and systems generate
more data than ever
before.
Pay-as-you-go pricing
allows organizations to
analyze data to gain
insights.
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
11
*Source: Forbes Online; New Vantage Partners - Big Data Executive Survey
https://www.forbes.com/sites/cognitiveworld/2019/02/06/data-the-fuel-powering-ai-digital-transformation/#5062b36b578b
Demand growing for faster
decision making on
real-time data.
Customers need your help
12
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
85% of businesses want to be data driven,
but only 37% have been successful.
https://www.forbes.com/sites/cognitiveworld/2019/02/06/data-the-fuel-powering-ai-digital-transformation/#51efb027578b
http://newvantage.com/wp-content/uploads/2017/01/Big-Data-Executive-Survey-2017-Executive-Summary.pdf
Common data analytics challenges
13
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Top four challenges
involve knowledge, skill,
security, and privacy
This is your opportunity
Data security (unauthorized access to company
data)
Data privacy issues (safety of personal data)
What challenges do you see when using big data
analytics/technologies? (n=545)
Inadequate technical know-how in our company
53%
49%
48%
48%
Inadequate analytical know-how in our company
https://bi-survey.com/challenges-big-data-analytics
AWS data analytics portfolio
overview
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
14
Secure infrastructure for analytics
Customers need multiple levels of security, identity and access
management, encryption, and compliance to secure their data
lake.
15
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Compliance
AWS Artifact
Amazon Inspector
AWS CloudHSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS Well-Architected Tool
Amazon Macie
Amazon Virtual Private
Cloud (Amazon VPC)
Encryption
AWS Certificate Manager Private
Certificate Authority (ACM Private CA)
AWS Key Management Service (AWS
KMS)
Encryption at rest
Encryption in transit
Bring your own keys,
hardware security module (HSM)
support
Identity
AWS Identify and Access
Management (IAM)
AWS Single Sign-On
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
AWS data analytics portfolio
AWS Database Migration Service (AWS DMS) | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data
Firehose
Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka
Data movement
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
16
Amazon
QuickSight
Amazon
SageMaker
Amazon
Comprehend
Amazon
Lex
Amazon
Polly
Amazon
Rekognition
Amazon
Translate
Amazon
Pinpoint
AWS Data
Exchange
Data visualization, engagement, and machine learning
Amazon
Redshift
Amazon EMR
(Spark and Presto)
Amazon
Athena
Amazon
Elasticsearch
Service
Amazon Kinesis
Data Analytics
AWS Glue
(Spark and Python)
Analytics
Amazon Simple Storage Service (Amazon
S3) & Amazon S3 Glacier
AWS
Glue
AWS Lake Formation
Data lake infrastructure and management
Data movement services
Help customers move data from on premises to the cloud
17
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS DMS AWS Snowball AWS
Snowmobile
Amazon
Managed
Streaming for
Kafka
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
Data lake services
Customers are constrained by volume, variety, veracity, and
velocity of on-premises data, and data silos pose a major challenge.
18
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon S3 Amazon S3 Glacier AWS Lake
Formation
AWS Glue
Analytics services
Help customers extract value out of their data
19
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift Amazon EMR AWS Glue
Amazon ES
Amazon
Athena
Amazon Kinesis
Data Analytics
Data visualization, engagement, and
machine learning services
Help customers understand and visualize their data, and use
machine learning (ML) for advanced analytics and predictions
20
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
QuickSight
Amazon SageMaker
AWS Data
Exchange
AWS value proposition
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
21
Standards, formats, and open source
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
• Apache Flink
• Ganglia
• Apache HBase
• HCatalog
• Hadoop Distributed
File System (HDFS)
• Apache Hive
• Hudi
• Java
• JupyterHub
• Apache Kafka
• Apache Livy
• Apache Mahout
• MapReduce
• Apache MXNet
• MySQL
• Apache Oozie
• Apache ORC
• Apache Parquet
• Phoenix
• Apache Pig
• Presto
• Python
• PyTorch
• R
• Scala
• Apache Spark
• Sqoop
• SQL
• TensorFlow
• Tez
• Yarn
• Apache Zeppelin
• Apache Zookeeper
…and many more
22
AWS alternatives to open source
23
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon EMR Amazon ES
Managed Streaming
for Apache Kafka
Real-time
analytics
Kafka
Operational
analytics
Elasticsearch
Logstash
Kibana
Spark, Hive, Presto,
Flink, HBase
Hadoop
Spark
Data analytics pipeline
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
24
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data management challenges
How can customers:
• Collect a variety of data types accumulating at varying velocities?
• Collect data from numerous sources accumulating at differing velocities?
• Store massive amounts of data without running out of space?
• Cleanse and augment data quality to be analyzed?
Can they automate these steps?
25
Data analytics pipeline
Collect
Store
Process and
analyze
Visualize
Insights
Time-to-answer (latency)
Balance of throughput and cost
Data Insights
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf?did=wp_card&trk=wp_card
26
Data pipeline challenges
Building a data pipeline is challenging. Customers must:
• Manage updates, patches, and software integrations
• Handle increased overhead costs plus need for support
• Maintain focus on the core task of building applications that lead to data
insights
27
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS data analytics pipeline services
28
Collect Store Process and analyze Visualize
Automate
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
Amazon
Kinesis Data
Streams
AWS
Snowball
Amazon
S3 Glacier
Amazon S3
Amazon DynamoDB Amazon RDS
Amazon Aurora
Amazon
CloudSearch
Amazon ES
Amazon EMR
Amazon Kinesis
Data Analytics
Amazon
QuickSight
Amazon Redshift
Amazon
Athena
AWS Database
Migration Service
Amazon
SageMaker
AWS Glue
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Managed
Streaming for
Kafka
Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
29
010010010
01010001
100010100
Data Flywheel and customer journey
Build data-driven
applications
Modernize data
warehouse and
build a data
lake
Migrate data and
workloads to the cloud
 Save time
 Save costs
Store and
manage data
 Agility
 Global distribution
 Scale and performance
 New and faster insights
 Broader access to
analytics
Innovate with
machine
learning
 Better experiences
 Deeper engagement
 Efficient processes
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved. 30
Attract new customers
Generate more data
Data
https://pages.awscloud.com/data-flywheel.html
Summary
In this module, you learned about:
• Customer challenges related to data analytics
• AWS data analytics portfolio
• Technical benefits of AWS data analytics solutions
• Data analytics pipeline
• Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
31
Module 3: Data Analytics
Solutions on AWS – Part I
Objectives
In this module, you will learn how to:
• Explain data migration options from on premises to the AWS Cloud
• Describe two AWS data analytics technical solutions
• Modernizing a data warehouse with Amazon Redshift
• Data lakes
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
33
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Data migration options
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
34
Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Data
warehouse
modernization
100110000100
101011100101
010111001010
100001011111
011010
001111001011
0010110
010001100001
0
Types of data
Data
governance
Machine
learning
Real-time
analytics with
streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
35
AWS data migration options
36
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Snowball
AWS Storage
Gateway
Amazon S3 Transfer
Acceleration
AWS Direct
Connect
AWS Database
Migration Service
Amazon Kinesis
Data Firehose
• File gateway
• Tape gateway
• Volume gateway
• Snowball Edge storage
optimized
• AWS Snowmobile
Solution 1: Modernizing a
data warehouse with Amazon
Redshift
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
37
Journey to a modern data
architecture
Evolution of data architecture
Traditional
data
warehousing
Data lakes
on AWS
Data
warehouse
modernization
100110000100
101011100101
010111001010
100001011111
011010
001111001011
0010110
010001100001
0
Types of data
Data
governanc
e
Machine
learning
Real-time
analytics with
streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
38
Data warehouses
39
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
42
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Traditional architecture and on-premises
data warehouse challenges
• Difficult to scale
• Long lead times for hardware procurement
• Complex upgrades are the norm
• High overhead costs for administration
• Expensive licensing and support costs
• Proprietary formats do not support newer open data formats, which results in data silos
• Data not cataloged, unreliable quality
• Licensing cost limits number of users and how much data can be accommodated
• Difficult to integrate with services and tools
Amazon Redshift
43
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift
A fully managed data warehouse that is highly integrated
with other AWS services. Features include:
• Optimized for high performance
• Support for open file formats
• Petabyte-scale capability
• Support for complex queries and analytics, with data
visualization tools
• Secure end-to-end encryption and certified compliance
• Service Level Agreement (SLA) of 99.9 percent
• Based on open source Postgres database
• Cost efficient
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://aws.amazon.com/redshift/pricing/
Amazon
Redshift
Secure data warehouse that extends seamlessly to a data
lake
44
Amazon Redshift performance
features
Breaks a large job it into
smaller tasks, then distributes
the tasks to multiple compute
nodes
45
Independent and resilient
nodes without any
dependencies
Data from each column is
stored together so the data
can be accessed faster, without
scanning and sorting all other
columns
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Massively parallel
processing (MPP)
Columnar storage Shared-nothing
architecture
Result: Faster processing time Result: Compression of stored
data improves performance
Result: Improves scalability
Amazon Redshift architecture
46
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Client
applications
Leader node
Compute Node 1 Compute Node 2
Data warehouse cluster
Java Database
Connectivity
(JDBC)
Open Database
Connectivity
(ODBC)
https://docs.aws.amazon.com/redshift/index.html
Node slices Node slices
Leader node
Responsible for communication with the client application
and compute nodes
47
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift leader
node:
• SQL endpoint
• Metadata
• Query compilation and
optimization
• Coordinates parallel
SQL processing
• Machine learning (ML)
optimizations
Leader node
Compute node 1 Compute node 2
Data warehouse cluster
Node slices Node slices
Compute node
• SQL running powerhouses
• Compute node can load, unload, backup,
and restore data to and from Amazon S3.
• Node clusters range from 1 to 128.
48
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Runs queries in parallel and returns the result to the leader node
Leader node
Compute node 1 Compute node 2
Data warehouse cluster
Node slices Node slices
Compute node slices
49
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Slices are a symmetric multiprocessing (SMP) mechanism.
Slice 1 | Slice 2
Local
disk
Local
disk
Virtual
core
Virtual
core
7.5 GB
RAM
7.5 GB
RAM
• Partitioned into slices.
• Slices work in parallel to
complete operations.
• Virtual processors contained
in each compute node.
• Each slice is allocated an
equal amount of memory,
compute allowance, and disk
space.
• Each slice operates in
parallel but can request data
from other slices.
Compute node 1 Compute node 2
Data warehouse cluster
Node slices Node slices
Amazon Redshift instance types
51
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://docs.aws.amazon.com/redshift/latest/gsg/getting-
started.html
Management interfaces
52
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://us-west-2.console.aws.amazon.com/redshiftv2/home?region=us-west-
2#query-editor
Amazon Redshift
differentiating features
53
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift
differentiating features
54
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Federated query
Amazon Redshift
lake house architecture
Federated query
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data
warehouse
Amazon
Aurora
OLT
P
ERP CRM LOB
Integrate queries on live data in Amazon RDS
for PostegreSQL and Amazon Aurora
PostgreSQL with queries on Amazon Redshift
and Amazon data lake
Reduce data moved over the network with
Amazon Redshift’s intelligent optimizer.
Pushes and distributes portions of
computation directly into remote operational
databases
Benefits
• Incorporate live data into business
intelligence (BI) and reporting applications
• Ingest data into Amazon Redshift
• Query operational databases directly
• Apply transformations on the fly
• Load data into target tables without
complex ETL pipelines
55
Amazon Redshift
lake house architecture
With Amazon Redshift lake house
architecture, customers can:
• Query data in the data lake and
write data back in open formats
• Use familiar SQL statements to
combine and process data across
data stores
• Run queries on live data in
operational databases without
requiring data loading and ETL
pipelines
56
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift lake house queries are run by a fleet of nodes that
are owned and maintained by AWS.
https://aws.amazon.com/redshift/lake-house-architecture/
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 57
SQL clients, business intelligence tools
Leader node
Compute node 1
Node slices
JDBC/ODBC
Compute node 2
Node slices
Amazon S3 AWS Glue Data
Catalog
Amazon Redshift
lake house
Amazon Redshift
lake house fleet
1
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Query
2
Query is optimized and compiled
using ML at the leader node.
Determine what is run locally and
what goes to Amazon
Redshift lake house.
3 Query plan sent
to all compute
nodes.
4 Compute nodes
obtained from the Data
Catalog; dynamically
prune partitions.
5 Each compute node issues
multiple requests to Amazon
Redshift lake house layers.
6 Amazon Redshift lake house
nodes scan Amazon S3 data.
7 Amazon Redshift lake house
projects, filters, joins, and
aggregates.
8 Final aggregations and join
with local Amazon Redshift
tables done in-cluster.
9 Result is sent to client.
Advanced Query Accelerator
(AQUA)
A new distributed and hardware-accelerated cache that makes Amazon Redshift
faster than other cloud data warehouses, without increasing cost
58
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Minimizes data movement over the
network
by pushing operations to Advanced
Query Accelerator (AQUA) nodes
AQUA nodes with custom AWS designed
analytics processors to make operations
(compression, encryption, filtering, and
aggregations) faster than traditional
CPUs
RA3
cluster
AQUA node
Custom
AWS
designed
processor
Running in parallel
Amazon Redshift managed
storage
RA3
cluster
RA3
cluster
AQUA node
Custom
AWS
designed
processor
AQUA node
Custom
AWS
designed
processor
AQUA node
Custom
AWS
designed
processor
Migration to Amazon Redshift
59
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Migration pattern
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Migration from a legacy OLAP system
Workload Qualification Framework (WQF) uses the AWS Schema Conversion Tool (AWS SCT) to
generate reports, such as:
• Workload assessment based on complexity, size of migration effort, and technologies
• Recommendations on migration strategies
• Step-by-step instructions for migration
• Assessment of migration effort based on team size and member roles
60
AWS SCT data extractors
Extract data from your data warehouse and migrate to Amazon Redshift
• Extracts data through local migration agents
• Data is optimized for Amazon Redshift and saved in local files
• Files are loaded to an Amazon S3 bucket (through network or AWS Snowball Edge)
and then to Amazon Redshift
Amazon
Redshift
AWS SCT Amazon
S3 bucket
Source DW
NETEZZA
Microsoft SQL
Server
Equinox sees faster
reports, 80% cost savings
Challenge
Their data warehouse had limited integration, was very expensive,
and required a lot of platform-specific domain knowledge. They
needed to reduce administration and costs, blend structured and
semi-structured data for analytics, and evolve into a data lake
strategy.
Solution
Equinox migrated from a legacy data warehouse to Amazon Redshift to
combine data from disparate sources like clickstream data, cycling log
data, club management software, and more. They land data directly
in an Amazon S3 data lake and perform analytics using Amazon
Redshift, Amazon Redshift Spectrum, and Amazon EMR.
Result
Their monthly Amazon Redshift bill is now 20% of prior yearly
maintenance of their legacy data warehouse. AWS data lake and
analytics reduced report delivery time from months to days.
Amazon Redshift Amazon S3 Amazon EMR
Use case: Equinox (continued)
68
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Clickstream
Cycling logs
Club
management
software
Applications
Social
Equinox
applications
Third-party
applications
Maximilia
n (ELT
scripts)
Spark on
Amazon
EMR
• Migrated from Teradata data
warehouse
• Built a data warehouse with
Amazon Redshift and data lake with
Amazon S3
• Analytics on data lake with Amazon
Athena, Amazon Redshift Spectrum,
and Amazon EMR
• Increased user productivity to
move faster
• Amazon Redshift costs
approximately 20% of original
Teradata maintenance and support
• Report time reduced from months
to days
Amazon
Redshift
Amazon
Athena
Amazon EMR
Amazon
Redshift
Amazon S3
Solution 2: Data lakes
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
70
Journey to a modern data
architecture
Evolution of data architecture
Traditional
data
warehousing
Data lakes
on AWS
Data
warehouse
modernization
100110000100
101011100101
010111001010
100001011111
011010
001111001011
0010110
010001100001
0
Types of data
Data
governance
Machine
learning
Real-time
analytics with
streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
71
Data lakes defined
73
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
• Stores all structured, semi-structured,
unstructured, and binary data at unlimited
scale
• Holds curated and raw data
• Uses AWS data analytics tools for analytics
• Increases pace of innovation by extracting
insights from data
• Enables more organizational agility
• Reduces cost and delivers results with
predictive analytics and ML
Architectural approach for a centralized
enterprise data repository stored on
Amazon S3
Machine
learning
Business
intelligence
and
analytics
Data
warehousing
Data lake
Open formats
central catalog
Secure data lake on Amazon S3
74
Amazon S3
Access Points
Amazon S3
object lock
Amazon S3
object tags
Amazon S3
Block Public Access
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
FSx for
Lustre
• Multi-tenant bucket
• Dedicated access points
• Customer permissions
from an Amazon Virtual
Private Cloud (Amazon
VPC)
• Across AWS accounts
and Amazon S3 bucket
level
• Specify public
permissions using
Access Control List (ACL)
or policy
• Four settings:
• BlockPublicAcls
• IgnorePublicAcls
• BlockPublicPolicy
• RestrictPublicBuckets
• Access control, lifecycle
policies, and analysis
• Classify data with
metadata
• Use tags to filter objects
• Define replication
policies
• Populate tags with AWS
Lambda functions or S3
Batch Operations
• Immutable Amazon S3
objects
• Retention management
controls
• Data protection and
compliance
https://aws.amazon.com/compliance/services-in-scope
75
IAM
Amazon CloudWatch AWS STS AWS CloudTrail
AWS KMS
Protect and secure
Machine
learning
Amazon QuickSightAmazon EMR
Amazon
Redshift
Amazon
Athena
Processing and analytics
Amazon
Kinesis
AWS
Direct ConnectAWS Snowball
AWS DMS
AWS Data
Exchange
Data ingestion
AWS Glue Amazon ES
Amazon DynamoDB
Catalog and search
Amazon API Gateway IAM Amazon Cognito
Access and user interface
Amazon S3
Central storage
Reference architecture:
Data lake on AWS
Data services – AWS Glue
76
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Cleansing data
After migration, data still presents challenges:
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
77
Data is increasingly diverse
• Volume
• Variety
• Velocity
• Veracity
It accumulates rapidly
• Missing or incorrect
data
• Wrong data format
• Partial missing data
Avoid unsearchable data
It must be cleansed before
analyzed by many applications
How can customers provide access to users to gain insights?
AWS Glue
78
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Glue Data
Catalog
Job authoring
Job running
Job workflow
 Hive metastore compatible with enhanced functionality
 Crawlers automatically extracts metadata and creates tables
 Integrates with Amazon Athena, Amazon EMR, and many more
 Run jobs on a serverless Spark platform
 Use flexible scheduling, job monitoring, and alerting
 Generates ETL code
 Build on open frameworks – Python, Scala, and Apache
Spark
 Developer-centric – editing, debugging, sharing
 Orchestrate triggers, crawlers, and jobs
 Author and monitor entire flows and integrated
alerting
AWS Glue crawlers
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
79
Amazon Redshift
Amazon DynamoDB
Amazon S3
Databases
AWS IAM role
AWS Glue crawler
JDBC
connection
NoSQL
connection
Object
connection
Built-in
classifiers
MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Amazon Redshift
Apache Avro
Parquet
ORC
XML
JSON and JSONPaths
AWS CloudTrail
Binary JSON (BSON)
Logs
Delimited
… growing
AWS Glue Data Catalog services
80
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Glue Data
Catalog
Amazon
Redshift lake
house
Amazon
Athena
AWS Glue ETL
Amazon EMR
Use case: Log aggregation with ETL
81
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS service logs
Web application logs
Server logs
Amazon S3
bucket
AWS Glue
crawler
Update table partition
Create partition
on Amazon S3
Query data
AWS Glue ETL
Amazon S3
bucket
AWS Glue Data
Catalog
Amazon
Athena
Data services – AWS Data
Exchange and Amazon Athena
82
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Data Exchange
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Find diverse data in one
place
Analyze data Access third-party data
Find and subscribe to third-party data in the cloud
• More than 1,000 data products
• More than 80 data providers
• Download of copy of data to
Amazon S3
• Combine, analyze, and model
with existing data
• Streamlined access to data
• Minimize legal reviews and
negotiations
83
Amazon Athena
84
No setup costs Streamlined
Open
Pay per query
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Interactive query service to analyze data in Amazon S3 using
standard SQL
SQL
$
Zero setup costs,
point to Amazon
S3 and start
querying
Pay only for queries run,
save 30%–90% on
per-query costs through
compression
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Serverless, zero
infrastructure, zero
administration,
integrated with Amazon
QuickSight
AWS Lake Formation
85
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Challenges of building a secure data
lake
Typical steps to build a secure data lake
Move data
2 Cleanse,
prepare, and
catalog data
3
Configure and
enforce security
and compliance
policies
4
Make data available
for analytics
5
Set up
storage
1
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
86
Data engineer Data security officer Data analyst
Ingestion and cleaning Security
Analytics and machine learning
AWS Lake Formation for a secure data
lake
Secure and
control
Collaborate and
use
Monitor and audit
Ingest and
organize
Automates creating
data lake and data
ingestion.
Sets up fine-grained
access control and
data governance.
Search and data
discovery using Data
Catalog metadata.
To protect data, all
access is checked
against set policies.
Based on data access
and governance
policies, alert
notifications are raised
on policy violation and
logged.
2 3 4
1
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
87
AWS Lake Formation benefits
89
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
Redshift
Amazon
Athena
AWS Glue
Amazon EMR
Amazon
QuickSight
Amazon
SageMaker
AWS Lake
Formatio
n
Blueprints ML
Transforms
Data
Catalog
Access
control
Amazon S3
data lake storage
Cost effective, durable
storage includes global
replication capabilities.
Simplified ingest and cleaning
enables data engineers to
build faster.
Centralized management of
fine-grained permissions
empowers security officers.
Comprehensive set of
integrated tools enables every
user equally.
Data visualization with
Amazon QuickSight
90
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon QuickSight
91
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
BI service built for the cloud with pay-per-session pricing and ML insights
Scalable
Automatically scales with use and
activity, with no additional
infrastructure requirements.
Seamlessly grows with customers.
Pay monthly or annually.
With pay-per-session pricing,
customers only pay when they access
their reports and dashboards, with no
upfront costs.
Pay for use
Fully managed cloud application,
meaning there's no upfront cost,
software to deploy, capacity planning,
maintenance, upgrades, or
migrations.
Serverless and fully
managed Deeply integrated with data sources and
other AWS services like Amazon
Redshift, Amazon S3, Athena, Amazon
Aurora, Amazon RDS, IAM, AWS
CloudTrail, and Amazon Cloud
Directory– providing customers with
everything they need for an end-to-end
cloud BI solution.
Fully integrated
Serverless data lakes and analytics
Amazon S3
AWS Glue
crawler
AWS Glue Data
Catalog
Amazon
Athena
Amazon EMR
Amazon
Redshift
Spectrum
Amazon
QuickSight
Amazon RDS
Web app data
Other databases
On-premises data
Streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
92
Use case: COVID-19 pandemic
95
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Challenge
The COVID-19 pandemic has
stressed healthcare systems,
businesses, and economies. It
has disrupted the daily lives of
people around the world.
People need a solution to
capture data (diagnosis,
mortality, and recovery rates)
globally in real time, and turn
the data into insights they can
share and respond to with
confidence.
Solution
Amazon worked with APN
Partners Salesforce, Tableau,
and MuleSoft to create a
secure data lake using AWS
Data Exchange, AWS Glue,
Amazon Athena, and Amazon
S3 as a store of trusted data
from open source COVID-19
data providers.
Benefits
Health workers, scientists, and
decision makers can access
and compare international
data to their local data,
enabling understanding and
visualization of the impact of
COVID-19 locally and globally.
This solution enables decision
making and deeper insights to
help manage and flatten the
COVID-19 curve until a
vaccine is available.
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
96
Use case: COVID-19 data lake architecture
https://d2908q01vomqb2.cloudfront.net/77de68daecd823babbb58
edb1c8e14d7106e83bb/2020/05/29/COVID-19-AWS-Tableau-
Tableau: COVID-19 data platform Visualization for
desktop for users
Upload to Amazon S3
Amazon S3
Amazon S3 Amazon
Athena
AWS Glue
Lambda function Data revision
export to Amazon S3
Define
Athena Schema
AWS Cloud
AWS Data Exchange
Publish and update data products with
AWS Data Exchange
Connect to S3 data with
Amazon Athena
connector in Tableau
Summary
97
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governanc
e
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Amazon
Redshift
• Amazon S3
• AWS Glue
• AWS Data Exchange
• Amazon Athena
• AWS Lake
Formation
• Amazon QuickSight
AWS data
migration
options
Activity: Serverless Data Lake
Lab Demonstration
Activity overview
The activity consists of a video demonstration of three key steps:
• Step 1: Build a serverless data lake
• Build a data lake with an AWS CloudFormation template
• Load raw New York City (NYC) taxi data into Amazon S3 bucket
• Program an AWS Glue ETL job to convert raw taxi data into Parquet data storage
format
• Step 2: Run Amazon Athena query
• Run a SQL query with Amazon Athena to query taxi data in Parquet format
• Step 3: Visualize data with Amazon QuickSight
• Use Amazon Athena to visualize data with Amazon QuickSight
99
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-
trigger-for-the-data-catalog-and-etl-jobs/
Step 1: Serverless data lake architecture
100
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Glue
Crawle
r
AWS
Lambda
Amazon S3 Amazon
CloudWatch
Amazon SQS
Amazon SNS
AWS
Lambda
Amazon S3
Amazon
CloudWatch
AWS Glue
Raw zone Processed zone
Email notification
ETL job
Module 4: AWS Data Analytics
Solutions – Part II
Objectives
In this module, you will learn about three key types of data
analytics technical solutions on AWS:
• Streaming and real-time analytics with Amazon Kinesis
• Data governance
• Extended solution: Insights and monetization with machine learning (ML)
108
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Solution 3: Streaming and
real-time analytics with
Amazon Kinesis
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
109
Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Types of data used
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
110
Streaming data defined
111
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data that is generated continuously from thousands
of data sources, sent simultaneously
Player-game
interactions Geolocation of
cars and devices
Music
downloads
Website clicks
Social media
streams
Common use cases: Real-time
analytics
112
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Milliseconds Seconds Minutes Hours
• Messaging between
microservices
• Response analytics
(web and mobile
application
notifications)
• Log ingestion
• Internet of Things (IoT)
device maintenance
• Change data capture
(CDC)
• Streaming ETL
into data lakes
and data
warehouse
The value of data diminishes over time
Enabling real-time analytics
113
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data streaming technology enables a customer to ingest, process, and
analyze high volumes of high-velocity data from a variety of sources, in real
time.
1. 2. 3. 4. 5.
Data streaming solution challenges
Difficult to set up
Difficult to achieve high
availability
Error prone and complex to
manage
Tricky to scale
Integration requires
development
Expensive to maintain
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
114
Challenges of building on-premises, real-time streaming solutions:
AWS streaming data solutions
Efficiently collect, process, and analyze data streams in real
time
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
115
Amazon Kinesis
Data Analytics
Data generators: Simple streaming
data patterns
116
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data producers Streaming services Data consumers
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Streams
Mobile and
applications
Amazon Kinesis Agent
Amazon Kinesis Data
Streams
Amazon CloudWatch Logs
Amazon CloudWatch
Events
AWS IoT
Apache Kafka
Amazon Kinesis Producer
Library (KPL)
Amazon EMR
Amazon Redshift
Amazon Simple
Storage Service (S3)
Amazon EC2
Amazon Kinesis
Connector
Library
Amazon Kinesis Data Streams
117
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis Data Streams
118
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Massively scalable, highly durable data ingestion and processing
service optimized for real-time data streaming
No upfront cost
low, pay-as-you-
go pricing
70
Data collected is
available within
milliseconds
Real-time analytics
• Dashboards
• Anomaly detection
• Dynamic pricing
Data synchronously
replicates data
across
3 Availability
Zones in a Region
Data can be stored up
to 7 Days
Serverless, can scale
dynamically to handle
MB to TB Thousands to
millions
each hour
of PutRecords
each second
and
https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
How Kinesis Data Streams works
119
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis
Data Analytics
Amazon EC2
AWS Lambda
Input
Output
Spark on Amazon EMR
Amazon
Kinesis Data
Streams
Capture and send data Ingest and store data
streams for processing
Build custom, real-time
applications
Analyze streaming data
using BI tools
Kinesis Data Streams architecture
120
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon EC2
instances
Client
Mobile client
Traditional
server
Data
producers
Shard
1
Shard
2
Shard
N
Amazon
Kinesis Data
Stream
EC2
instance
EC2
instance
Data
consumers
Amazon Redshift
Amazon S3
Amazon
Kinesis Data
Firehose
Amazon EMR
Amazon DynamoDB
Shard 1
Data
record
• Sequence #
• Partition Key
• Data blob
Data stream
https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
Amazon
Kinesis Data
Firehose
Amazon
Kinesis Data
Analytics
Kinesis Data Streams provisioning
121
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis Data Firehose
122
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
How Kinesis Data Firehose works
123
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
Kinesis Data
Firehose
Input
Output
Splunk
Amazon Redshift
Amazon S3
Amazon
Elasticsearch Service
Capture and send data Prepares and loads data
continuously to the
selected destinations
Durably store the data
for analytics
Analyze streaming data
using analytics tools
Kinesis Data Streams and
Kinesis Data Firehose
124
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Characteristics Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
Processing time
As fast as 70 milliseconds after
ingestion
Between 60–900 seconds
Stream storage and
duration
In shards, default 24 hours and up to 7
days
Max buffer size 128 MB and max time 900
seconds
Data transformation
and conversion
None Uses AWS Lambda and AWS Glue
Data producer
Amazon Kinesis Agent, applications using Amazon Kinesis Producer Library (KPL),
AWS SDK for Java, Amazon CloudWatch Logs and CloudWatch Events, AWS IoT
Data consumer
AWS Lambda, Amazon Kinesis Data
Analytics, Amazon Kinesis Data
Firehose, Applications using the Kinesis
Client Library (KCL) and SDK for Java
AWS Lambda, Amazon Kinesis Data
Analytics, and Kinesis Data Firehose, apps
using the KCL and SWK for Java, Amazon
S3, Amazon Redshift, Amazon ES, Splunk,
and Amazon Kinesis Data Analytics
Data compression None gzip, Snappy, Zip, or no data compression
https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
When to use Kinesis Data Streams
and Kinesis Data Firehose
125
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis
Data Firehose
For data streaming applications with massive ingestion requirements
• Requires data to be sent to consumer analytics services for millisecond
response time
• Massively scalable
• Data retention time ranging from hours to days
• Example: Real-time gaming
Amazon Kinesis
Data Streams
For data streaming applications that require near real-time responses in
seconds
• Need for data augmentation, data transformation, or data compression
• Need to save data to Amazon S3, Amazon Redshift, Amazon ES, Splunk,
or send data to Amazon Kinesis Data Analytics for analytics
• Example: Log analytics
Amazon Kinesis Data Analytics
126
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis Data Analytics
127
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Input
Amazon Kinesis
Data Analytics Output
Capture streaming data
with Amazon MSK,
Amazon Kinesis Data
Streams, Amazon Kinesis
Data Firehose, or other
data sources
Query and analyze
streaming data
Send processes data
to analytics tools to
create alerts and
respond in real time
Kinesis data analytics application
details
128
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Use case: Clickstream analytics
s129
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
Kinesis Data
Firehose
Input Output
Amazon
Kinesis Data
Firehose
Amazon
Kinesis Data
Analytics
Amazon Redshift
Evolve from batch processing to real-time analytics
Websites send
clickstream data
Collects the data
and sends to Kinesis
Data Analytics
Processes data in
near-real time
Loads
processed data
into Amazon
Redshift
Runs analytics
models to
identify content
recommendatio
ns
Readers see
personalized
content
suggestions and
increase
engagement
Put it all together:
Streaming data analytics with
AWS
130
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Streaming data analytics
architecture
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
131
Amazon
Redshift
Amazon
RDS
DynamoDB
Kinesis
Data Streams
Kinesis
Data Firehose
Kinesis
Data Analytics
Amazon
Elasticsearch
Service
Amazon S3
data lake
AWS Lambda
Amazon Simple
Notification Service
Amazon
Kinesis
enabled
applications
Millions of
data sources
Machine
learning
Kinesis
Data Streams
Kinesis
Data Firehose
Data science
Reporting
Logs and
processed data
Downstream
applications
Alerts Notification
s
1
2
3
4
5
Fan-out
Solution 4: Data governance
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
135
Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Types of data used
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
136
Challenges of data in data lakes
• Securing data
• Auditing data usage
• Managing data access
• Safeguarding sensitive data and PII
• Maintaining regulations and
mandates
137
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data security and governance
© ENTERPRISE STRATEGY GROUP, 2019.
With big data comes big
responsibility.
More than one in three companies cite data privacy and
governance as a hurdle to both digital transformation and IoT
initiatives
34% 37%
of IT decision makers cite ensuring
data governance/privacy as one of
their organization’s biggest digital
transformation challenges
of IT decision makers cite ensuring
security/compliance upon movement
of data as one of their most
important IoT priorities over the next
18–24 months
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
138
https://www.esg-global.com/hubfs/ESG-Infographic-IT-Spending-Intentions-
Resolving PII dangers
139
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Personally
identifiable
information
(PII)
Consumer
consent
violation
Data
breach
Spyware
Unsecured
devices
Rogue
agents
Second-
party
misuse
Espionage
External
hacking
• Do these issues need to be
resolved?
• Is there a solution
architecture that solves all
PII issues?
• What best practices can be
used to mitigate PII
dangers?
Amazon Macie
140
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Macie
Continually evaluate
Amazon S3
environment
Discover
sensitive data
Take action
Enable Amazon
Macie with one-
click in the AWS
Management
Console or with a
single API call
Automatically
generates an
inventory of
Amazon S3 bucket
and details on the
bucket-level
security and access
controls
Analyzes bucket using
ML and pattern
matching to discover
sensitive data, like PII
Generates findings
and sends to
Amazon
CloudWatch
Events for
integration into
workflows and
remediation
actions
• Financial
• Personal
• National
• Medical
• Credentials and
secrets
De-identified data lake (DIDL) on AWS
A de-identified data lake (DIDL) is an architectural approach that reduces the
risks associated with managing data, particularly personally identifiable
information (PII).
Benefits
Reduce risk
• Remove PII before it enters a data lake
Understand all the data
• Create a Data Catalog of an entire data lake
Reduce compliance costs
• Automate the discovery, classification, de-identification,
and ongoing monitoring of data across an organization
Turn data into an asset, not a liability
• Enable a broader set of governed analytic and machine learning use cases
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
141
Masking PII data
142
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Email Customer ID Transcript
csalazar@example.com 19664 Just talked to Carlos Salazar
mary@example.com 23423 Mary’s SSN is 000000000
mateo@example.com 99644 Mateo is moving to Nevada
NA 02945
It is expected to rain
tomorrow
Email Customer ID Transcript
4t34gttt 7462391 Just talked to Jane Roe
44e5325 1239474 Jorge’s SSN is 666666666
0we&yrw 9983487 Sofia is moving to Texas
NA 3344325
It is expected to rain
tomorrow
Email ID Name, SSN, State
Extended solution 5: Insights
and monetization with ML on
AWS
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
143
Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governanc
e
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Types of data used
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
144
Data lakes and machine learning
Machine learning requires:
• More data: Collect all types of data
• Flexibility: Define schema during analysis
• Scalability: Scale storage and compute (CPU
or GPU) independently
• Data transformation and processing: Run a
broad set of processing and analytics on the
same data without movement
• Security: Networking, identity, encryption, and
compliance
OLTP ERP CRM LOB
Data warehouse
Business
analytics
10011000010010101
11001010101110010
10100001011111011
010
00111100101100101
10
0100011000010
Data lake
Device
s
We
b
Sensor
s
Social
Data Catalog
AI and
machine learning
Data warehouse
queries
Big data
processing
Interactive Real time
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
145
Amazon SageMaker
146
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Machine learning at enterprise scale
Build
Train and tune
Deploy and manage
Notebooks for
common
problems
High-
performance
algorithms
• Managed Jupyter for enterprise data science
• Sample notebooks for most common use
cases
• Single-pass, streaming training algorithms
One-click
training
Hyperparameter
optimization
One-click
deployment
Fully managed
elastic hosting
• Training models at scale without DevOps
assistance
• ML on ML to optimize hyperparameters
• Deploy to production with a single call
• Fully managed, production-grade inferences
https://aws.amazon.com/machine-learning/?nc2=h_ql_prod_ml
Machine learning resources
• Fundamental digital course
on how SageMaker
mitigates the core
challenges of implementing
an ML pipeline
• Duration: 30 minutes
• https://www.aws.training/De
tails/Video?id=49646
148
• Explore how to use the
machine learning pipeline to
solve a real business
problem (intermediate)
• Duration: 4 days
• https://www.aws.training/Se
ssionSearch?pageNumber=1
&courseId=38910
• Learn to solve real-world use
cases with machine learning
(intermediate)
• Duration: 1 day
• https://www.aws.training/Se
ssionSearch?pageNumber=1
&courseId=40748
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Foundations: How
Amazon SageMaker Can Help
Practical Data Science with
Amazon SageMaker
The Machine Learning Pipeline
on AWS
https://partnercentral.awspartner.com/LmsSsoRedirect?RelayState=%2flearningobject%
2fcurriculum%3fid%3d25521
AWS STP: Machine Learning (ML) on AWS for ML Practitioners - Technical
Summary
150
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
• Kinesis Data
Streams
• Kinesis Data
Firehose
• Kinesis Data
Analytics
Amazon Macie Amazon
SageMaker
Module 5: AWS Technical
Conversations and
Engagement
Technical engagement
conversations using the Data
Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
The Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
3. Build
data-driven applications
4. Analyze with
data lake
architectures
1. Move and store
data in the cloud
2. Move and manage all
workloads in the cloud
5. Innovate with
machine learning
154
Conversations using the Data
Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
3. Build
data-driven apps
4. Analyze with
data lake
architectures
5. Innovate with
machine learning
1. Move and store
data in the cloud
2. Move and manage all
workloads in the cloud
155
AWS six-phase strategy
for implementing a data
analytics solution
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
156
Data analytics in
the cloud
assessment
Phase 1
Use case
Identification
Phase 2
Architecture
and data
migration
Phase 3
POC
delivery
Phase 4
Application
tuning and
optimization
Phase 5
Migration
from POC to
production
Phase 6
Data analytics projects: A phased strategy
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
157
Phase 1: Data analytics in the cloud
assessment
Phase 1
Use case
identification
Phase 2
Architecture
and data
migration
Phase 3
POC
Delivery
Phase 4
Application
tuning and
optimization
Phase 5
Migration
from POC to
production
Phase 6
Data analytics
in the cloud
Assessment
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
158
Phase 2: Use case identification
Data
analytics in
the cloud
assessment
Phase 1
Architecture
and data
migration
Phase 3
POC
delivery
Phase 4
Application
tuning and
optimization
Phase 5
Migration
from POC to
production
Phase 6
Use case
identification
Phase 2
Use case
identification
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
162
Phase 3: Architecture and data
migration
Data
analytics in
the cloud
Assessment
P H A S E 1
Use case
identification
P H A S E 2
POC
delivery
P H A S E 4
Application
tuning and
optimization
P H A S E 5
Migration
from POC to
production
P H A S E 6
Architecture
and data
migration
P H A S E 3
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
168
Architecture and data migration: APN Partner
best practices
Architecture
and data
migration
Phase 3
Engaging AWS
Support too late in the
process
A v o i d
Engage AWS
AWS Partner
Development Managers
Partner Solutions
Architects
AWS Professional
Services
D o
Phase 4: Proof of concept delivery
Data
Analytics in
the cloud
assessment
P H A S E 1
Use case
identification
P H A S E 2
Architecture
and data
migration
P H A S E 3
Application
tuning and
optimization
P H A S E 5
Migration
from POC to
production
P H A S E 6
POC delivery
P H A S E 4
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
171
Phase 5: Application tuning
and optimization
Data
analytics in
the cloud
assessment
P H A S E 1
Use case
identification
P H A S E 2
Architecture
and data
migration
P H A S E 3
POC
Delivery
P H A S E 4
Migration
from POC to
production
P H A S E 6
Application
tuning and
optimization
P H A S E 5
Application
tuning and
optimization
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
173
Phase 6: Migration from POC
to production
Data
analytics in
the cloud
assessment
P H A S E 1
Use case
identification
P H A S E 2
Architecture
and data
migration
P H A S E 3
POC
delivery
P H A S E 4
Application
tuning and
optimization
P H A S E 5
Migration
from POC to
production
P H A S E 6
Migration
from POC to
production
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
175
177
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Phase 6: POC to production best practices
POC to
production
Phase 6
• Identify groups and roles in the
organization that requested the POC
• Create a thought-out plan
• Set up a continuous integration and
continuous delivery (CI/CD) pipeline
• Set up metrics and alarms for
production environment
• Continue engagement with the
customer
D o
AWS well-architected review
using the Analytics Lens
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
183
10 design principles:
Analytics applications, 1–5
1. Automate data ingestion to handle big data
2. Design ingestion for failures and duplicates
3. Preserve original source data
4. Describe data with metadata
5. Establish data lineage
184
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Analytics-Lens.pdf
10 design principles:
Analytics applications, 6–10
6. Use the right ETL tool for the job
7. Orchestrate ETL workflows
8. Tier storage appropriately
9. Secure, protect, and manage the entire analytics pipeline
10. Design for scalable and reliable analytics pipelines
185
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Analytics-
Lens.pdf
Module 6: APN Partner
Opportunities and Resources
Objectives
In this module, you will learn how to:
• Describe how to collaborate with AWS for data analytics
• Describe AWS Data and Analytics resources for APN Partners:
• Competency categories
• AWS Immersion Days
• AWS Certified Data Analytics and learning resources
• Access the AWS Marketplace
• Perform the calls to action
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
201
APN Partners and
AWS for Data Analytics
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Discounting and funding programs
Migration
programs
POC funding
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
203
AWS Data and Analytics
Competency categories
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data Analytics
Platforms
NoSQL/New SQL
Data Integration and
Preparation
Business Intelligence
(BI) and Data
Visualization
Data Governance and
Security
Provide a set of integrated tools to solve data
analytics challenges within a standard
framework
Provide highly scalable databases that
organize data into a structure
Enable customers to move and consolidate
data from disparate sources, transform it,
and prepare it for analytics
Help customers turn raw data into actionable business
information, such as reporting, dashboards, and data
visualization
Help customers discover, categorize, and control their
data
204
Best practices after identifying an
opportunity
Use existing Partner
programs
Cultivate strong
relationships with
AWS sales teams
Register your
opportunity
through
APN Partner Central
Achieve AWS Data and
Analytics competency
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
205
Collaboration workflow
Build a
reference
solution
Conduct a big
data POC
Validate the
POC
Build and
deliver the live
solution
Receive
approval from
AWS PSM
Engage
AWS sales
Engage AWS
account or
Partner SA
Register an
opportunity on
APN Partner
Central
Before SA
involvement
Direct SA
involvement
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
206
AWS Professional Services
• Global team of experts
• Collaborate with APN Partners to help customers realize their
desired business outcomes in AWS Cloud
• Reach out to APN Partners when they need additional resources
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Professional Services: https://aws.amazon.com/professional-services/
207
AWS data analytics solutions
and Immersion Days
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Data Lab program
• The AWS Data Lab program offers accelerated joint engineering
engagements between a team of customer builders and AWS
technical resources to create tangible deliverables that
accelerate data and analytics modernization initiatives.
• Two offerings:
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Design Lab
Focus on real-world
architectural design
Build Lab
Focus on providing
guidance with
building a
functioning
prototype with a
customer team
Duration
Half day to 5 days
Location
Virtual or AWS Data Lab hub – Seattle,
NYC, Herndon (VA), London, Bangalore
Cost
Free. Reach out to your APN support
team for more information.
209
https://aws.amazon.com/aws-data-lab/
AWS Immersion Days
Designed to help APN Advanced and Premier Consulting Partners deliver technical data
analytics workshops to their customers and help grow their businesses
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data Engineering
Immersion Day
Build a serverless data lake
solution on AWS including
modules focusing on
ingestion, hydration,
exploration, and
consumption
https://aws.amazon.com/partners/immersion-days/
Amazon EMR
Immersion Day
Focus on unique facets of
Amazon EMR for big data
workloads
Database Migration
Immersion Day
Give your customers a head
start with the AWS Database
Migration Service and the
Schema Conversion Tool
… and many more.
Benefits: Access to technical workshop content, AWS usage credits, Market Development
Funds (MDF) opportunities, and support from AWS teams
210
AWS Certified data analytics
and learning resources
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Technical Professional Learning
Path
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
215
AWS Certified Data Analytics –
Specialty
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved. https://aws.amazon.com/certification/certified-data-analytics-specialty/ 216
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Partner Cast: Analytics
218
Call to action
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Build a data analytic practice on AWS
Build packaged
solutions
Know your
Partner Solutions
Architect
Ask for customer
references
Engage with AWS
service teams
Develop
customer
workshops
Achieve an APN
competency
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
220
Call to action
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Use the Data
Flywheel to perform
assessments
Work with your
Partner team to
schedule an
Immersion Day for
your customers
View the analytics
customer case
studies
https://aws.amazon.co
m/big-data/datalakes-
and-analytics/
Create a specialized
service around one
of the analytics
services
Participate in the
AWS Data Lab
https://aws.amazon.co
m/aws-data-lab/
Prepare for the AWS
Data Analytics –
Specialty
certification
Build relationships
with APN teams for
funding
opportunities for
your marketing and
sales efforts
221
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior
written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections or feedback on the course, please email
us at: aws-course-feedback@amazon.com. For all other questions, contact us at: https://aws.amazon.com/contact-us/aws-training/. All trademarks are the
property of their owners.
Thank You!
Parvesh Chopra : choprapa@amazon.com

More Related Content

What's hot

Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Amazon Web Services
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 

What's hot (20)

Welcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution OverviewWelcome & AWS Big Data Solution Overview
Welcome & AWS Big Data Solution Overview
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Building Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWSBuilding Data Lakes for Analytics on AWS
Building Data Lakes for Analytics on AWS
 
AWS Summit Seoul 2023 | 오픈소스 데이터베이스로 탈 오라클! Why not?
AWS Summit Seoul 2023 | 오픈소스 데이터베이스로 탈 오라클! Why not?AWS Summit Seoul 2023 | 오픈소스 데이터베이스로 탈 오라클! Why not?
AWS Summit Seoul 2023 | 오픈소스 데이터베이스로 탈 오라클! Why not?
 
BDA311 Introduction to AWS Glue
BDA311 Introduction to AWS GlueBDA311 Introduction to AWS Glue
BDA311 Introduction to AWS Glue
 
Speed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWSSpeed up data preparation for ML pipelines on AWS
Speed up data preparation for ML pipelines on AWS
 
AWS Machine Learning Week SF: Build, Train & Deploy ML Models Using SageMaker
AWS Machine Learning Week SF: Build, Train & Deploy ML Models Using SageMakerAWS Machine Learning Week SF: Build, Train & Deploy ML Models Using SageMaker
AWS Machine Learning Week SF: Build, Train & Deploy ML Models Using SageMaker
 
AWS Summit Seoul 2023 | 롯데면세점이 고객에게 차별화된 경험을 제공하는 방법: AWS Native 서비스를 활용한 초개인...
AWS Summit Seoul 2023 | 롯데면세점이 고객에게 차별화된 경험을 제공하는 방법: AWS Native 서비스를 활용한 초개인...AWS Summit Seoul 2023 | 롯데면세점이 고객에게 차별화된 경험을 제공하는 방법: AWS Native 서비스를 활용한 초개인...
AWS Summit Seoul 2023 | 롯데면세점이 고객에게 차별화된 경험을 제공하는 방법: AWS Native 서비스를 활용한 초개인...
 
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
Leveraging the AWS Sales Methodology and Partner Best Practices aws-partner-s...
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
글로벌 기업들의 효과적인 데이터 분석을 위한 Data Lake 구축 및 분석 사례 - 김준형 (AWS 솔루션즈 아키텍트)
 
Well Architected Framework - Data
Well Architected Framework - Data Well Architected Framework - Data
Well Architected Framework - Data
 
클라우드를 활용한 기업 가치 극대화- 방희란 AWS시니어 어카운트 매니저/ 정재표, 대한항공ERP 재무담당 과장:: AWS Summit...
클라우드를 활용한 기업 가치 극대화- 방희란 AWS시니어 어카운트 매니저/ 정재표,  대한항공ERP 재무담당 과장::  AWS Summit...클라우드를 활용한 기업 가치 극대화- 방희란 AWS시니어 어카운트 매니저/ 정재표,  대한항공ERP 재무담당 과장::  AWS Summit...
클라우드를 활용한 기업 가치 극대화- 방희란 AWS시니어 어카운트 매니저/ 정재표, 대한항공ERP 재무담당 과장:: AWS Summit...
 
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
만들자! 데이터 기반의 스마트 팩토리 - 문태양 AWS 솔루션즈 아키텍트 / 배권 팀장, OCI 정보통신 :: AWS Summit Seou...
 
AWS Summit Seoul 2023 | 천만 사용자 서비스를 위한 Amazon SageMaker 활용 방법 진화하기
AWS Summit Seoul 2023 | 천만 사용자 서비스를 위한 Amazon SageMaker 활용 방법 진화하기AWS Summit Seoul 2023 | 천만 사용자 서비스를 위한 Amazon SageMaker 활용 방법 진화하기
AWS Summit Seoul 2023 | 천만 사용자 서비스를 위한 Amazon SageMaker 활용 방법 진화하기
 
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
Effective Data Lakes: Challenges and Design Patterns (ANT316) - AWS re:Invent...
 
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
SaaS on AWS - ISV challenges
SaaS on AWS - ISV challengesSaaS on AWS - ISV challenges
SaaS on AWS - ISV challenges
 
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
Building Data Lakes and Analytics on AWS; Patterns and Best Practices - BDA30...
 

Similar to AWS Partner Data Analytics on AWS_Handout.pdf

在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
Amazon Web Services
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Amazon Web Services
 

Similar to AWS Partner Data Analytics on AWS_Handout.pdf (20)

Develop Integrations for Salesforce and AWS (API320) - AWS re:Invent 2018
Develop Integrations for Salesforce and AWS (API320) - AWS re:Invent 2018Develop Integrations for Salesforce and AWS (API320) - AWS re:Invent 2018
Develop Integrations for Salesforce and AWS (API320) - AWS re:Invent 2018
 
Get More from your Data: Accelerate Time-to-Value and Reduce TCO with Conflue...
Get More from your Data: Accelerate Time-to-Value and Reduce TCO with Conflue...Get More from your Data: Accelerate Time-to-Value and Reduce TCO with Conflue...
Get More from your Data: Accelerate Time-to-Value and Reduce TCO with Conflue...
 
Single View of Data
Single View of DataSingle View of Data
Single View of Data
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Preparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/MLPreparing Your Data for Cloud Analytics & AI/ML
Preparing Your Data for Cloud Analytics & AI/ML
 
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven DecisionsLeveraging Data Analytics in the Cloud to Support Data-Driven Decisions
Leveraging Data Analytics in the Cloud to Support Data-Driven Decisions
 
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
Enabling Your Organization’s Amazon Redshift Adoption – Going from Zero to He...
 
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
Big Data Meets AI - Driving Insights and Adding Intelligence to Your Solutions
 
Data Con LA 2022 - Modern Data Strategy
Data Con LA 2022 - Modern Data StrategyData Con LA 2022 - Modern Data Strategy
Data Con LA 2022 - Modern Data Strategy
 
Confluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdfConfluent_AWS_ImmersionDay_Q42023.pdf
Confluent_AWS_ImmersionDay_Q42023.pdf
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
Driving Machine Learning and Analytics Use Cases with AWS Storage (STG302) - ...
 
Building-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWSBuilding-Serverless-Analytics-On-AWS
Building-Serverless-Analytics-On-AWS
 
在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析在 AWS 上構建無服務器分析
在 AWS 上構建無服務器分析
 
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
Transformation Track AWS Cloud Experience Argentina - Why Enterprise Workload...
 
Improve Time to Market with Real-Time Analytics on Time-Series Data
Improve Time to Market with Real-Time Analytics on Time-Series DataImprove Time to Market with Real-Time Analytics on Time-Series Data
Improve Time to Market with Real-Time Analytics on Time-Series Data
 
AWS Lake Formation Deep Dive
AWS Lake Formation Deep DiveAWS Lake Formation Deep Dive
AWS Lake Formation Deep Dive
 
DevopsDays Geneva 2020 - Compliance & Governance as Code
DevopsDays Geneva 2020 - Compliance & Governance as CodeDevopsDays Geneva 2020 - Compliance & Governance as Code
DevopsDays Geneva 2020 - Compliance & Governance as Code
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
 
Mythbusting the Federal Cloud Journey
Mythbusting the Federal Cloud JourneyMythbusting the Federal Cloud Journey
Mythbusting the Federal Cloud Journey
 

Recently uploaded

%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Recently uploaded (20)

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
Devoxx UK 2024 - Going serverless with Quarkus, GraalVM native images and AWS...
 
WSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration ToolingWSO2Con2024 - Low-Code Integration Tooling
WSO2Con2024 - Low-Code Integration Tooling
 
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public AdministrationWSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
WSO2CON 2024 - How CSI Piemonte Is Apifying the Public Administration
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
WSO2Con2024 - GitOps in Action: Navigating Application Deployment in the Plat...
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
WSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - KanchanaWSO2Con2024 - Hello Choreo Presentation - Kanchana
WSO2Con2024 - Hello Choreo Presentation - Kanchana
 
WSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaSWSO2CON 2024 Slides - Open Source to SaaS
WSO2CON 2024 Slides - Open Source to SaaS
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
WSO2CON 2024 - IoT Needs CIAM: The Importance of Centralized IAM in a Growing...
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next IntegrationWSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
WSO2CON2024 - Why Should You Consider Ballerina for Your Next Integration
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 

AWS Partner Data Analytics on AWS_Handout.pdf

  • 1. AWS Partners: Data Analytics on AWS – Technical Parvesh Chopra choprapa@amazon.com
  • 3. Course objectives In this course, you will learn how to: • Identify Amazon Web Services (AWS) services in the AWS analytics stack • Describe decision points and technology selections for data analytics architectures • Design highly available and fault-tolerant serverless data analytics architectures • Discuss the AWS Data Pipeline and the customer data analytics journey using the Data Flywheel • Describe five AWS data analytics technical solutions: • Modernizing a data warehouse with Amazon Redshift • Data lakes • Streaming data • Data governance • Machine learning (ML) • Identify technical engagement strategies and best practices for delivering a proof of concept (POC) • Locate and use AWS Partner Network (APN) Partner resources for opportunities and training © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 5
  • 4. About this course • This course is for technical professionals at APN Consulting Partner organizations who are engaged in pre-sales discussions with customers to help architect data analytic solutions on AWS and answer technical questions about using AWS data analytics services. • This 1-day course is focused on educating technical professionals with sufficient technical knowledge on AWS data analytics services and solutions to successfully engage with and help customers. • This course is not designed to be a technical deep dive into AWS data analytics services and solutions. It provides the necessary resources and learning path towards gaining deeper knowledge into the services. 6 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. Agenda 7 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Module 1: Course Introduction Module 2: AWS Data Analytics Stack Portfolio Break Module 3: AWS Data Analytics Solutions – Part I - Data lake solution Break Module 4: AWS Data Analytics Solutions – Part II Break Module 5: Technical Engagement Strategies Module 6: APN Partner Opportunities and Resources
  • 6. Module 2: AWS Data Analytics Portfolio
  • 7. Objectives In this module, you will learn how to: • Understand customer challenges related to data analytics in their business • Provide a technical overview of AWS data analytics portfolio • Discuss technical advantages and position of data analytics solutions on AWS • Explain how to build a data analytics pipeline • Explain the Data Flywheel © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 9
  • 8. Customer challenges and opportunities for APN Partners © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 10
  • 9. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. New realities By making 10% more data accessible, a typical Fortune 1000 company will see a $65 million increase in net income.* Explosion of data- connected devices, apps, and systems generate more data than ever before. Pay-as-you-go pricing allows organizations to analyze data to gain insights. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 11 *Source: Forbes Online; New Vantage Partners - Big Data Executive Survey https://www.forbes.com/sites/cognitiveworld/2019/02/06/data-the-fuel-powering-ai-digital-transformation/#5062b36b578b Demand growing for faster decision making on real-time data.
  • 10. Customers need your help 12 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 85% of businesses want to be data driven, but only 37% have been successful. https://www.forbes.com/sites/cognitiveworld/2019/02/06/data-the-fuel-powering-ai-digital-transformation/#51efb027578b http://newvantage.com/wp-content/uploads/2017/01/Big-Data-Executive-Survey-2017-Executive-Summary.pdf
  • 11. Common data analytics challenges 13 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Top four challenges involve knowledge, skill, security, and privacy This is your opportunity Data security (unauthorized access to company data) Data privacy issues (safety of personal data) What challenges do you see when using big data analytics/technologies? (n=545) Inadequate technical know-how in our company 53% 49% 48% 48% Inadequate analytical know-how in our company https://bi-survey.com/challenges-big-data-analytics
  • 12. AWS data analytics portfolio overview © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 14
  • 13. Secure infrastructure for analytics Customers need multiple levels of security, identity and access management, encryption, and compliance to secure their data lake. 15 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Compliance AWS Artifact Amazon Inspector AWS CloudHSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWS Well-Architected Tool Amazon Macie Amazon Virtual Private Cloud (Amazon VPC) Encryption AWS Certificate Manager Private Certificate Authority (ACM Private CA) AWS Key Management Service (AWS KMS) Encryption at rest Encryption in transit Bring your own keys, hardware security module (HSM) support Identity AWS Identify and Access Management (IAM) AWS Single Sign-On Amazon Cloud Directory AWS Directory Service AWS Organizations
  • 14. AWS data analytics portfolio AWS Database Migration Service (AWS DMS) | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data Firehose Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka Data movement © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 16 Amazon QuickSight Amazon SageMaker Amazon Comprehend Amazon Lex Amazon Polly Amazon Rekognition Amazon Translate Amazon Pinpoint AWS Data Exchange Data visualization, engagement, and machine learning Amazon Redshift Amazon EMR (Spark and Presto) Amazon Athena Amazon Elasticsearch Service Amazon Kinesis Data Analytics AWS Glue (Spark and Python) Analytics Amazon Simple Storage Service (Amazon S3) & Amazon S3 Glacier AWS Glue AWS Lake Formation Data lake infrastructure and management
  • 15. Data movement services Help customers move data from on premises to the cloud 17 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS DMS AWS Snowball AWS Snowmobile Amazon Managed Streaming for Kafka Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
  • 16. Data lake services Customers are constrained by volume, variety, veracity, and velocity of on-premises data, and data silos pose a major challenge. 18 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon S3 Amazon S3 Glacier AWS Lake Formation AWS Glue
  • 17. Analytics services Help customers extract value out of their data 19 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Amazon EMR AWS Glue Amazon ES Amazon Athena Amazon Kinesis Data Analytics
  • 18. Data visualization, engagement, and machine learning services Help customers understand and visualize their data, and use machine learning (ML) for advanced analytics and predictions 20 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon QuickSight Amazon SageMaker AWS Data Exchange
  • 19. AWS value proposition © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 21
  • 20. Standards, formats, and open source © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. • Apache Flink • Ganglia • Apache HBase • HCatalog • Hadoop Distributed File System (HDFS) • Apache Hive • Hudi • Java • JupyterHub • Apache Kafka • Apache Livy • Apache Mahout • MapReduce • Apache MXNet • MySQL • Apache Oozie • Apache ORC • Apache Parquet • Phoenix • Apache Pig • Presto • Python • PyTorch • R • Scala • Apache Spark • Sqoop • SQL • TensorFlow • Tez • Yarn • Apache Zeppelin • Apache Zookeeper …and many more 22
  • 21. AWS alternatives to open source 23 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EMR Amazon ES Managed Streaming for Apache Kafka Real-time analytics Kafka Operational analytics Elasticsearch Logstash Kibana Spark, Hive, Presto, Flink, HBase Hadoop Spark
  • 22. Data analytics pipeline © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 24
  • 23. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data management challenges How can customers: • Collect a variety of data types accumulating at varying velocities? • Collect data from numerous sources accumulating at differing velocities? • Store massive amounts of data without running out of space? • Cleanse and augment data quality to be analyzed? Can they automate these steps? 25
  • 24. Data analytics pipeline Collect Store Process and analyze Visualize Insights Time-to-answer (latency) Balance of throughput and cost Data Insights © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf?did=wp_card&trk=wp_card 26
  • 25. Data pipeline challenges Building a data pipeline is challenging. Customers must: • Manage updates, patches, and software integrations • Handle increased overhead costs plus need for support • Maintain focus on the core task of building applications that lead to data insights 27 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 26. AWS data analytics pipeline services 28 Collect Store Process and analyze Visualize Automate Amazon Kinesis Data Firehose AWS Direct Connect Amazon Kinesis Data Streams AWS Snowball Amazon S3 Glacier Amazon S3 Amazon DynamoDB Amazon RDS Amazon Aurora Amazon CloudSearch Amazon ES Amazon EMR Amazon Kinesis Data Analytics Amazon QuickSight Amazon Redshift Amazon Athena AWS Database Migration Service Amazon SageMaker AWS Glue © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Managed Streaming for Kafka
  • 27. Data Flywheel © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 29
  • 28. 010010010 01010001 100010100 Data Flywheel and customer journey Build data-driven applications Modernize data warehouse and build a data lake Migrate data and workloads to the cloud  Save time  Save costs Store and manage data  Agility  Global distribution  Scale and performance  New and faster insights  Broader access to analytics Innovate with machine learning  Better experiences  Deeper engagement  Efficient processes © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 30 Attract new customers Generate more data Data https://pages.awscloud.com/data-flywheel.html
  • 29. Summary In this module, you learned about: • Customer challenges related to data analytics • AWS data analytics portfolio • Technical benefits of AWS data analytics solutions • Data analytics pipeline • Data Flywheel © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 31
  • 30. Module 3: Data Analytics Solutions on AWS – Part I
  • 31. Objectives In this module, you will learn how to: • Explain data migration options from on premises to the AWS Cloud • Describe two AWS data analytics technical solutions • Modernizing a data warehouse with Amazon Redshift • Data lakes © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 33 Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governance 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning
  • 32. Data migration options © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 34
  • 33. Journey to a modern data architecture Evolution of data architecture Traditional data warehousing Data lakes on AWS Data warehouse modernization 100110000100 101011100101 010111001010 100001011111 011010 001111001011 0010110 010001100001 0 Types of data Data governance Machine learning Real-time analytics with streaming data © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 35
  • 34. AWS data migration options 36 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Snowball AWS Storage Gateway Amazon S3 Transfer Acceleration AWS Direct Connect AWS Database Migration Service Amazon Kinesis Data Firehose • File gateway • Tape gateway • Volume gateway • Snowball Edge storage optimized • AWS Snowmobile
  • 35. Solution 1: Modernizing a data warehouse with Amazon Redshift © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 37
  • 36. Journey to a modern data architecture Evolution of data architecture Traditional data warehousing Data lakes on AWS Data warehouse modernization 100110000100 101011100101 010111001010 100001011111 011010 001111001011 0010110 010001100001 0 Types of data Data governanc e Machine learning Real-time analytics with streaming data © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 38
  • 37. Data warehouses 39 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 38. 42 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Traditional architecture and on-premises data warehouse challenges • Difficult to scale • Long lead times for hardware procurement • Complex upgrades are the norm • High overhead costs for administration • Expensive licensing and support costs • Proprietary formats do not support newer open data formats, which results in data silos • Data not cataloged, unreliable quality • Licensing cost limits number of users and how much data can be accommodated • Difficult to integrate with services and tools
  • 39. Amazon Redshift 43 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. Amazon Redshift A fully managed data warehouse that is highly integrated with other AWS services. Features include: • Optimized for high performance • Support for open file formats • Petabyte-scale capability • Support for complex queries and analytics, with data visualization tools • Secure end-to-end encryption and certified compliance • Service Level Agreement (SLA) of 99.9 percent • Based on open source Postgres database • Cost efficient © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/redshift/pricing/ Amazon Redshift Secure data warehouse that extends seamlessly to a data lake 44
  • 41. Amazon Redshift performance features Breaks a large job it into smaller tasks, then distributes the tasks to multiple compute nodes 45 Independent and resilient nodes without any dependencies Data from each column is stored together so the data can be accessed faster, without scanning and sorting all other columns © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Massively parallel processing (MPP) Columnar storage Shared-nothing architecture Result: Faster processing time Result: Compression of stored data improves performance Result: Improves scalability
  • 42. Amazon Redshift architecture 46 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Client applications Leader node Compute Node 1 Compute Node 2 Data warehouse cluster Java Database Connectivity (JDBC) Open Database Connectivity (ODBC) https://docs.aws.amazon.com/redshift/index.html Node slices Node slices
  • 43. Leader node Responsible for communication with the client application and compute nodes 47 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift leader node: • SQL endpoint • Metadata • Query compilation and optimization • Coordinates parallel SQL processing • Machine learning (ML) optimizations Leader node Compute node 1 Compute node 2 Data warehouse cluster Node slices Node slices
  • 44. Compute node • SQL running powerhouses • Compute node can load, unload, backup, and restore data to and from Amazon S3. • Node clusters range from 1 to 128. 48 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Runs queries in parallel and returns the result to the leader node Leader node Compute node 1 Compute node 2 Data warehouse cluster Node slices Node slices
  • 45. Compute node slices 49 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Slices are a symmetric multiprocessing (SMP) mechanism. Slice 1 | Slice 2 Local disk Local disk Virtual core Virtual core 7.5 GB RAM 7.5 GB RAM • Partitioned into slices. • Slices work in parallel to complete operations. • Virtual processors contained in each compute node. • Each slice is allocated an equal amount of memory, compute allowance, and disk space. • Each slice operates in parallel but can request data from other slices. Compute node 1 Compute node 2 Data warehouse cluster Node slices Node slices
  • 46. Amazon Redshift instance types 51 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://docs.aws.amazon.com/redshift/latest/gsg/getting- started.html
  • 47. Management interfaces 52 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://us-west-2.console.aws.amazon.com/redshiftv2/home?region=us-west- 2#query-editor
  • 48. Amazon Redshift differentiating features 53 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 49. Amazon Redshift differentiating features 54 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Federated query Amazon Redshift lake house architecture
  • 50. Federated query © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data warehouse Amazon Aurora OLT P ERP CRM LOB Integrate queries on live data in Amazon RDS for PostegreSQL and Amazon Aurora PostgreSQL with queries on Amazon Redshift and Amazon data lake Reduce data moved over the network with Amazon Redshift’s intelligent optimizer. Pushes and distributes portions of computation directly into remote operational databases Benefits • Incorporate live data into business intelligence (BI) and reporting applications • Ingest data into Amazon Redshift • Query operational databases directly • Apply transformations on the fly • Load data into target tables without complex ETL pipelines 55
  • 51. Amazon Redshift lake house architecture With Amazon Redshift lake house architecture, customers can: • Query data in the data lake and write data back in open formats • Use familiar SQL statements to combine and process data across data stores • Run queries on live data in operational databases without requiring data loading and ETL pipelines 56 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift lake house queries are run by a fleet of nodes that are owned and maintained by AWS. https://aws.amazon.com/redshift/lake-house-architecture/
  • 52. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 57 SQL clients, business intelligence tools Leader node Compute node 1 Node slices JDBC/ODBC Compute node 2 Node slices Amazon S3 AWS Glue Data Catalog Amazon Redshift lake house Amazon Redshift lake house fleet 1 SELECT COUNT(*) FROM S3.EXT_TABLE GROUP BY… Query 2 Query is optimized and compiled using ML at the leader node. Determine what is run locally and what goes to Amazon Redshift lake house. 3 Query plan sent to all compute nodes. 4 Compute nodes obtained from the Data Catalog; dynamically prune partitions. 5 Each compute node issues multiple requests to Amazon Redshift lake house layers. 6 Amazon Redshift lake house nodes scan Amazon S3 data. 7 Amazon Redshift lake house projects, filters, joins, and aggregates. 8 Final aggregations and join with local Amazon Redshift tables done in-cluster. 9 Result is sent to client.
  • 53. Advanced Query Accelerator (AQUA) A new distributed and hardware-accelerated cache that makes Amazon Redshift faster than other cloud data warehouses, without increasing cost 58 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Minimizes data movement over the network by pushing operations to Advanced Query Accelerator (AQUA) nodes AQUA nodes with custom AWS designed analytics processors to make operations (compression, encryption, filtering, and aggregations) faster than traditional CPUs RA3 cluster AQUA node Custom AWS designed processor Running in parallel Amazon Redshift managed storage RA3 cluster RA3 cluster AQUA node Custom AWS designed processor AQUA node Custom AWS designed processor AQUA node Custom AWS designed processor
  • 54. Migration to Amazon Redshift 59 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 55. Migration pattern © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Migration from a legacy OLAP system Workload Qualification Framework (WQF) uses the AWS Schema Conversion Tool (AWS SCT) to generate reports, such as: • Workload assessment based on complexity, size of migration effort, and technologies • Recommendations on migration strategies • Step-by-step instructions for migration • Assessment of migration effort based on team size and member roles 60
  • 56. AWS SCT data extractors Extract data from your data warehouse and migrate to Amazon Redshift • Extracts data through local migration agents • Data is optimized for Amazon Redshift and saved in local files • Files are loaded to an Amazon S3 bucket (through network or AWS Snowball Edge) and then to Amazon Redshift Amazon Redshift AWS SCT Amazon S3 bucket Source DW NETEZZA Microsoft SQL Server
  • 57. Equinox sees faster reports, 80% cost savings Challenge Their data warehouse had limited integration, was very expensive, and required a lot of platform-specific domain knowledge. They needed to reduce administration and costs, blend structured and semi-structured data for analytics, and evolve into a data lake strategy. Solution Equinox migrated from a legacy data warehouse to Amazon Redshift to combine data from disparate sources like clickstream data, cycling log data, club management software, and more. They land data directly in an Amazon S3 data lake and perform analytics using Amazon Redshift, Amazon Redshift Spectrum, and Amazon EMR. Result Their monthly Amazon Redshift bill is now 20% of prior yearly maintenance of their legacy data warehouse. AWS data lake and analytics reduced report delivery time from months to days. Amazon Redshift Amazon S3 Amazon EMR
  • 58. Use case: Equinox (continued) 68 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Clickstream Cycling logs Club management software Applications Social Equinox applications Third-party applications Maximilia n (ELT scripts) Spark on Amazon EMR • Migrated from Teradata data warehouse • Built a data warehouse with Amazon Redshift and data lake with Amazon S3 • Analytics on data lake with Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR • Increased user productivity to move faster • Amazon Redshift costs approximately 20% of original Teradata maintenance and support • Report time reduced from months to days Amazon Redshift Amazon Athena Amazon EMR Amazon Redshift Amazon S3
  • 59. Solution 2: Data lakes © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 70
  • 60. Journey to a modern data architecture Evolution of data architecture Traditional data warehousing Data lakes on AWS Data warehouse modernization 100110000100 101011100101 010111001010 100001011111 011010 001111001011 0010110 010001100001 0 Types of data Data governance Machine learning Real-time analytics with streaming data © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 71
  • 61. Data lakes defined 73 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. • Stores all structured, semi-structured, unstructured, and binary data at unlimited scale • Holds curated and raw data • Uses AWS data analytics tools for analytics • Increases pace of innovation by extracting insights from data • Enables more organizational agility • Reduces cost and delivers results with predictive analytics and ML Architectural approach for a centralized enterprise data repository stored on Amazon S3 Machine learning Business intelligence and analytics Data warehousing Data lake Open formats central catalog
  • 62. Secure data lake on Amazon S3 74 Amazon S3 Access Points Amazon S3 object lock Amazon S3 object tags Amazon S3 Block Public Access © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon FSx for Lustre • Multi-tenant bucket • Dedicated access points • Customer permissions from an Amazon Virtual Private Cloud (Amazon VPC) • Across AWS accounts and Amazon S3 bucket level • Specify public permissions using Access Control List (ACL) or policy • Four settings: • BlockPublicAcls • IgnorePublicAcls • BlockPublicPolicy • RestrictPublicBuckets • Access control, lifecycle policies, and analysis • Classify data with metadata • Use tags to filter objects • Define replication policies • Populate tags with AWS Lambda functions or S3 Batch Operations • Immutable Amazon S3 objects • Retention management controls • Data protection and compliance https://aws.amazon.com/compliance/services-in-scope
  • 63. 75 IAM Amazon CloudWatch AWS STS AWS CloudTrail AWS KMS Protect and secure Machine learning Amazon QuickSightAmazon EMR Amazon Redshift Amazon Athena Processing and analytics Amazon Kinesis AWS Direct ConnectAWS Snowball AWS DMS AWS Data Exchange Data ingestion AWS Glue Amazon ES Amazon DynamoDB Catalog and search Amazon API Gateway IAM Amazon Cognito Access and user interface Amazon S3 Central storage Reference architecture: Data lake on AWS
  • 64. Data services – AWS Glue 76 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 65. Cleansing data After migration, data still presents challenges: © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 77 Data is increasingly diverse • Volume • Variety • Velocity • Veracity It accumulates rapidly • Missing or incorrect data • Wrong data format • Partial missing data Avoid unsearchable data It must be cleansed before analyzed by many applications How can customers provide access to users to gain insights?
  • 66. AWS Glue 78 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Data Catalog Job authoring Job running Job workflow  Hive metastore compatible with enhanced functionality  Crawlers automatically extracts metadata and creates tables  Integrates with Amazon Athena, Amazon EMR, and many more  Run jobs on a serverless Spark platform  Use flexible scheduling, job monitoring, and alerting  Generates ETL code  Build on open frameworks – Python, Scala, and Apache Spark  Developer-centric – editing, debugging, sharing  Orchestrate triggers, crawlers, and jobs  Author and monitor entire flows and integrated alerting
  • 67. AWS Glue crawlers © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 79 Amazon Redshift Amazon DynamoDB Amazon S3 Databases AWS IAM role AWS Glue crawler JDBC connection NoSQL connection Object connection Built-in classifiers MySQL MariaDB PostgreSQL Amazon Aurora Oracle Amazon Redshift Apache Avro Parquet ORC XML JSON and JSONPaths AWS CloudTrail Binary JSON (BSON) Logs Delimited … growing
  • 68. AWS Glue Data Catalog services 80 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Data Catalog Amazon Redshift lake house Amazon Athena AWS Glue ETL Amazon EMR
  • 69. Use case: Log aggregation with ETL 81 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS service logs Web application logs Server logs Amazon S3 bucket AWS Glue crawler Update table partition Create partition on Amazon S3 Query data AWS Glue ETL Amazon S3 bucket AWS Glue Data Catalog Amazon Athena
  • 70. Data services – AWS Data Exchange and Amazon Athena 82 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 71. AWS Data Exchange © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Find diverse data in one place Analyze data Access third-party data Find and subscribe to third-party data in the cloud • More than 1,000 data products • More than 80 data providers • Download of copy of data to Amazon S3 • Combine, analyze, and model with existing data • Streamlined access to data • Minimize legal reviews and negotiations 83
  • 72. Amazon Athena 84 No setup costs Streamlined Open Pay per query © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Interactive query service to analyze data in Amazon S3 using standard SQL SQL $ Zero setup costs, point to Amazon S3 and start querying Pay only for queries run, save 30%–90% on per-query costs through compression ANSI SQL interface, JDBC/ODBC drivers, multiple formats, compression types, and complex joins and data types Serverless, zero infrastructure, zero administration, integrated with Amazon QuickSight
  • 73. AWS Lake Formation 85 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 74. Challenges of building a secure data lake Typical steps to build a secure data lake Move data 2 Cleanse, prepare, and catalog data 3 Configure and enforce security and compliance policies 4 Make data available for analytics 5 Set up storage 1 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 86 Data engineer Data security officer Data analyst Ingestion and cleaning Security Analytics and machine learning
  • 75. AWS Lake Formation for a secure data lake Secure and control Collaborate and use Monitor and audit Ingest and organize Automates creating data lake and data ingestion. Sets up fine-grained access control and data governance. Search and data discovery using Data Catalog metadata. To protect data, all access is checked against set policies. Based on data access and governance policies, alert notifications are raised on policy violation and logged. 2 3 4 1 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 87
  • 76. AWS Lake Formation benefits 89 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Redshift Amazon Athena AWS Glue Amazon EMR Amazon QuickSight Amazon SageMaker AWS Lake Formatio n Blueprints ML Transforms Data Catalog Access control Amazon S3 data lake storage Cost effective, durable storage includes global replication capabilities. Simplified ingest and cleaning enables data engineers to build faster. Centralized management of fine-grained permissions empowers security officers. Comprehensive set of integrated tools enables every user equally.
  • 77. Data visualization with Amazon QuickSight 90 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 78. Amazon QuickSight 91 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. BI service built for the cloud with pay-per-session pricing and ML insights Scalable Automatically scales with use and activity, with no additional infrastructure requirements. Seamlessly grows with customers. Pay monthly or annually. With pay-per-session pricing, customers only pay when they access their reports and dashboards, with no upfront costs. Pay for use Fully managed cloud application, meaning there's no upfront cost, software to deploy, capacity planning, maintenance, upgrades, or migrations. Serverless and fully managed Deeply integrated with data sources and other AWS services like Amazon Redshift, Amazon S3, Athena, Amazon Aurora, Amazon RDS, IAM, AWS CloudTrail, and Amazon Cloud Directory– providing customers with everything they need for an end-to-end cloud BI solution. Fully integrated
  • 79. Serverless data lakes and analytics Amazon S3 AWS Glue crawler AWS Glue Data Catalog Amazon Athena Amazon EMR Amazon Redshift Spectrum Amazon QuickSight Amazon RDS Web app data Other databases On-premises data Streaming data © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 92
  • 80. Use case: COVID-19 pandemic 95 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Challenge The COVID-19 pandemic has stressed healthcare systems, businesses, and economies. It has disrupted the daily lives of people around the world. People need a solution to capture data (diagnosis, mortality, and recovery rates) globally in real time, and turn the data into insights they can share and respond to with confidence. Solution Amazon worked with APN Partners Salesforce, Tableau, and MuleSoft to create a secure data lake using AWS Data Exchange, AWS Glue, Amazon Athena, and Amazon S3 as a store of trusted data from open source COVID-19 data providers. Benefits Health workers, scientists, and decision makers can access and compare international data to their local data, enabling understanding and visualization of the impact of COVID-19 locally and globally. This solution enables decision making and deeper insights to help manage and flatten the COVID-19 curve until a vaccine is available.
  • 81. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 96 Use case: COVID-19 data lake architecture https://d2908q01vomqb2.cloudfront.net/77de68daecd823babbb58 edb1c8e14d7106e83bb/2020/05/29/COVID-19-AWS-Tableau- Tableau: COVID-19 data platform Visualization for desktop for users Upload to Amazon S3 Amazon S3 Amazon S3 Amazon Athena AWS Glue Lambda function Data revision export to Amazon S3 Define Athena Schema AWS Cloud AWS Data Exchange Publish and update data products with AWS Data Exchange Connect to S3 data with Amazon Athena connector in Tableau
  • 82. Summary 97 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governanc e 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning Amazon Redshift • Amazon S3 • AWS Glue • AWS Data Exchange • Amazon Athena • AWS Lake Formation • Amazon QuickSight AWS data migration options
  • 83. Activity: Serverless Data Lake Lab Demonstration
  • 84. Activity overview The activity consists of a video demonstration of three key steps: • Step 1: Build a serverless data lake • Build a data lake with an AWS CloudFormation template • Load raw New York City (NYC) taxi data into Amazon S3 bucket • Program an AWS Glue ETL job to convert raw taxi data into Parquet data storage format • Step 2: Run Amazon Athena query • Run a SQL query with Amazon Athena to query taxi data in Parquet format • Step 3: Visualize data with Amazon QuickSight • Use Amazon Athena to visualize data with Amazon QuickSight 99 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue- trigger-for-the-data-catalog-and-etl-jobs/
  • 85. Step 1: Serverless data lake architecture 100 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Glue Crawle r AWS Lambda Amazon S3 Amazon CloudWatch Amazon SQS Amazon SNS AWS Lambda Amazon S3 Amazon CloudWatch AWS Glue Raw zone Processed zone Email notification ETL job
  • 86. Module 4: AWS Data Analytics Solutions – Part II
  • 87. Objectives In this module, you will learn about three key types of data analytics technical solutions on AWS: • Streaming and real-time analytics with Amazon Kinesis • Data governance • Extended solution: Insights and monetization with machine learning (ML) 108 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governance 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning
  • 88. Solution 3: Streaming and real-time analytics with Amazon Kinesis © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 109
  • 89. Journey to a modern data architecture Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governance 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning Types of data used © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 110
  • 90. Streaming data defined 111 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data that is generated continuously from thousands of data sources, sent simultaneously Player-game interactions Geolocation of cars and devices Music downloads Website clicks Social media streams
  • 91. Common use cases: Real-time analytics 112 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Milliseconds Seconds Minutes Hours • Messaging between microservices • Response analytics (web and mobile application notifications) • Log ingestion • Internet of Things (IoT) device maintenance • Change data capture (CDC) • Streaming ETL into data lakes and data warehouse The value of data diminishes over time
  • 92. Enabling real-time analytics 113 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data streaming technology enables a customer to ingest, process, and analyze high volumes of high-velocity data from a variety of sources, in real time. 1. 2. 3. 4. 5.
  • 93. Data streaming solution challenges Difficult to set up Difficult to achieve high availability Error prone and complex to manage Tricky to scale Integration requires development Expensive to maintain © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 114 Challenges of building on-premises, real-time streaming solutions:
  • 94. AWS streaming data solutions Efficiently collect, process, and analyze data streams in real time Amazon Kinesis Data Streams Amazon Kinesis Data Firehose © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 115 Amazon Kinesis Data Analytics
  • 95. Data generators: Simple streaming data patterns 116 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data producers Streaming services Data consumers Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Kinesis Data Streams Mobile and applications Amazon Kinesis Agent Amazon Kinesis Data Streams Amazon CloudWatch Logs Amazon CloudWatch Events AWS IoT Apache Kafka Amazon Kinesis Producer Library (KPL) Amazon EMR Amazon Redshift Amazon Simple Storage Service (S3) Amazon EC2 Amazon Kinesis Connector Library
  • 96. Amazon Kinesis Data Streams 117 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 97. Amazon Kinesis Data Streams 118 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Massively scalable, highly durable data ingestion and processing service optimized for real-time data streaming No upfront cost low, pay-as-you- go pricing 70 Data collected is available within milliseconds Real-time analytics • Dashboards • Anomaly detection • Dynamic pricing Data synchronously replicates data across 3 Availability Zones in a Region Data can be stored up to 7 Days Serverless, can scale dynamically to handle MB to TB Thousands to millions each hour of PutRecords each second and https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
  • 98. How Kinesis Data Streams works 119 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Analytics Amazon EC2 AWS Lambda Input Output Spark on Amazon EMR Amazon Kinesis Data Streams Capture and send data Ingest and store data streams for processing Build custom, real-time applications Analyze streaming data using BI tools
  • 99. Kinesis Data Streams architecture 120 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon EC2 instances Client Mobile client Traditional server Data producers Shard 1 Shard 2 Shard N Amazon Kinesis Data Stream EC2 instance EC2 instance Data consumers Amazon Redshift Amazon S3 Amazon Kinesis Data Firehose Amazon EMR Amazon DynamoDB Shard 1 Data record • Sequence # • Partition Key • Data blob Data stream https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5 Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics
  • 100. Kinesis Data Streams provisioning 121 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 101. Amazon Kinesis Data Firehose 122 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 102. How Kinesis Data Firehose works 123 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Firehose Input Output Splunk Amazon Redshift Amazon S3 Amazon Elasticsearch Service Capture and send data Prepares and loads data continuously to the selected destinations Durably store the data for analytics Analyze streaming data using analytics tools
  • 103. Kinesis Data Streams and Kinesis Data Firehose 124 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Characteristics Amazon Kinesis Data Streams Amazon Kinesis Data Firehose Processing time As fast as 70 milliseconds after ingestion Between 60–900 seconds Stream storage and duration In shards, default 24 hours and up to 7 days Max buffer size 128 MB and max time 900 seconds Data transformation and conversion None Uses AWS Lambda and AWS Glue Data producer Amazon Kinesis Agent, applications using Amazon Kinesis Producer Library (KPL), AWS SDK for Java, Amazon CloudWatch Logs and CloudWatch Events, AWS IoT Data consumer AWS Lambda, Amazon Kinesis Data Analytics, Amazon Kinesis Data Firehose, Applications using the Kinesis Client Library (KCL) and SDK for Java AWS Lambda, Amazon Kinesis Data Analytics, and Kinesis Data Firehose, apps using the KCL and SWK for Java, Amazon S3, Amazon Redshift, Amazon ES, Splunk, and Amazon Kinesis Data Analytics Data compression None gzip, Snappy, Zip, or no data compression https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5 https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
  • 104. When to use Kinesis Data Streams and Kinesis Data Firehose 125 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Firehose For data streaming applications with massive ingestion requirements • Requires data to be sent to consumer analytics services for millisecond response time • Massively scalable • Data retention time ranging from hours to days • Example: Real-time gaming Amazon Kinesis Data Streams For data streaming applications that require near real-time responses in seconds • Need for data augmentation, data transformation, or data compression • Need to save data to Amazon S3, Amazon Redshift, Amazon ES, Splunk, or send data to Amazon Kinesis Data Analytics for analytics • Example: Log analytics
  • 105. Amazon Kinesis Data Analytics 126 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 106. Amazon Kinesis Data Analytics 127 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Input Amazon Kinesis Data Analytics Output Capture streaming data with Amazon MSK, Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, or other data sources Query and analyze streaming data Send processes data to analytics tools to create alerts and respond in real time
  • 107. Kinesis data analytics application details 128 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 108. Use case: Clickstream analytics s129 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Kinesis Data Firehose Input Output Amazon Kinesis Data Firehose Amazon Kinesis Data Analytics Amazon Redshift Evolve from batch processing to real-time analytics Websites send clickstream data Collects the data and sends to Kinesis Data Analytics Processes data in near-real time Loads processed data into Amazon Redshift Runs analytics models to identify content recommendatio ns Readers see personalized content suggestions and increase engagement
  • 109. Put it all together: Streaming data analytics with AWS 130 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 110. Streaming data analytics architecture © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 131 Amazon Redshift Amazon RDS DynamoDB Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Amazon Elasticsearch Service Amazon S3 data lake AWS Lambda Amazon Simple Notification Service Amazon Kinesis enabled applications Millions of data sources Machine learning Kinesis Data Streams Kinesis Data Firehose Data science Reporting Logs and processed data Downstream applications Alerts Notification s 1 2 3 4 5 Fan-out
  • 111. Solution 4: Data governance © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 135
  • 112. Journey to a modern data architecture Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governance 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning Types of data used © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 136
  • 113. Challenges of data in data lakes • Securing data • Auditing data usage • Managing data access • Safeguarding sensitive data and PII • Maintaining regulations and mandates 137 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 114. Data security and governance © ENTERPRISE STRATEGY GROUP, 2019. With big data comes big responsibility. More than one in three companies cite data privacy and governance as a hurdle to both digital transformation and IoT initiatives 34% 37% of IT decision makers cite ensuring data governance/privacy as one of their organization’s biggest digital transformation challenges of IT decision makers cite ensuring security/compliance upon movement of data as one of their most important IoT priorities over the next 18–24 months © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 138 https://www.esg-global.com/hubfs/ESG-Infographic-IT-Spending-Intentions-
  • 115. Resolving PII dangers 139 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Personally identifiable information (PII) Consumer consent violation Data breach Spyware Unsecured devices Rogue agents Second- party misuse Espionage External hacking • Do these issues need to be resolved? • Is there a solution architecture that solves all PII issues? • What best practices can be used to mitigate PII dangers?
  • 116. Amazon Macie 140 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon Macie Continually evaluate Amazon S3 environment Discover sensitive data Take action Enable Amazon Macie with one- click in the AWS Management Console or with a single API call Automatically generates an inventory of Amazon S3 bucket and details on the bucket-level security and access controls Analyzes bucket using ML and pattern matching to discover sensitive data, like PII Generates findings and sends to Amazon CloudWatch Events for integration into workflows and remediation actions • Financial • Personal • National • Medical • Credentials and secrets
  • 117. De-identified data lake (DIDL) on AWS A de-identified data lake (DIDL) is an architectural approach that reduces the risks associated with managing data, particularly personally identifiable information (PII). Benefits Reduce risk • Remove PII before it enters a data lake Understand all the data • Create a Data Catalog of an entire data lake Reduce compliance costs • Automate the discovery, classification, de-identification, and ongoing monitoring of data across an organization Turn data into an asset, not a liability • Enable a broader set of governed analytic and machine learning use cases © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 141
  • 118. Masking PII data 142 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Email Customer ID Transcript csalazar@example.com 19664 Just talked to Carlos Salazar mary@example.com 23423 Mary’s SSN is 000000000 mateo@example.com 99644 Mateo is moving to Nevada NA 02945 It is expected to rain tomorrow Email Customer ID Transcript 4t34gttt 7462391 Just talked to Jane Roe 44e5325 1239474 Jorge’s SSN is 666666666 0we&yrw 9983487 Sofia is moving to Texas NA 3344325 It is expected to rain tomorrow Email ID Name, SSN, State
  • 119. Extended solution 5: Insights and monetization with ML on AWS © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 143
  • 120. Journey to a modern data architecture Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governanc e 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning Types of data used © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 144
  • 121. Data lakes and machine learning Machine learning requires: • More data: Collect all types of data • Flexibility: Define schema during analysis • Scalability: Scale storage and compute (CPU or GPU) independently • Data transformation and processing: Run a broad set of processing and analytics on the same data without movement • Security: Networking, identity, encryption, and compliance OLTP ERP CRM LOB Data warehouse Business analytics 10011000010010101 11001010101110010 10100001011111011 010 00111100101100101 10 0100011000010 Data lake Device s We b Sensor s Social Data Catalog AI and machine learning Data warehouse queries Big data processing Interactive Real time © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 145
  • 122. Amazon SageMaker 146 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Machine learning at enterprise scale Build Train and tune Deploy and manage Notebooks for common problems High- performance algorithms • Managed Jupyter for enterprise data science • Sample notebooks for most common use cases • Single-pass, streaming training algorithms One-click training Hyperparameter optimization One-click deployment Fully managed elastic hosting • Training models at scale without DevOps assistance • ML on ML to optimize hyperparameters • Deploy to production with a single call • Fully managed, production-grade inferences https://aws.amazon.com/machine-learning/?nc2=h_ql_prod_ml
  • 123. Machine learning resources • Fundamental digital course on how SageMaker mitigates the core challenges of implementing an ML pipeline • Duration: 30 minutes • https://www.aws.training/De tails/Video?id=49646 148 • Explore how to use the machine learning pipeline to solve a real business problem (intermediate) • Duration: 4 days • https://www.aws.training/Se ssionSearch?pageNumber=1 &courseId=38910 • Learn to solve real-world use cases with machine learning (intermediate) • Duration: 1 day • https://www.aws.training/Se ssionSearch?pageNumber=1 &courseId=40748 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Foundations: How Amazon SageMaker Can Help Practical Data Science with Amazon SageMaker The Machine Learning Pipeline on AWS https://partnercentral.awspartner.com/LmsSsoRedirect?RelayState=%2flearningobject% 2fcurriculum%3fid%3d25521 AWS STP: Machine Learning (ML) on AWS for ML Practitioners - Technical
  • 124. Summary 150 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Evolution of data architecture Traditional data warehousing Data lakes on AWS Real-time analytics with streaming data Data warehouse modernization Data governance 10011000010010101110010 10101110010101000010111 11011010 0011110010110010110 0100011000010 Machine learning • Kinesis Data Streams • Kinesis Data Firehose • Kinesis Data Analytics Amazon Macie Amazon SageMaker
  • 125. Module 5: AWS Technical Conversations and Engagement
  • 126. Technical engagement conversations using the Data Flywheel © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 127. The Data Flywheel © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 3. Build data-driven applications 4. Analyze with data lake architectures 1. Move and store data in the cloud 2. Move and manage all workloads in the cloud 5. Innovate with machine learning 154
  • 128. Conversations using the Data Flywheel © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 3. Build data-driven apps 4. Analyze with data lake architectures 5. Innovate with machine learning 1. Move and store data in the cloud 2. Move and manage all workloads in the cloud 155
  • 129. AWS six-phase strategy for implementing a data analytics solution © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 156
  • 130. Data analytics in the cloud assessment Phase 1 Use case Identification Phase 2 Architecture and data migration Phase 3 POC delivery Phase 4 Application tuning and optimization Phase 5 Migration from POC to production Phase 6 Data analytics projects: A phased strategy © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 157
  • 131. Phase 1: Data analytics in the cloud assessment Phase 1 Use case identification Phase 2 Architecture and data migration Phase 3 POC Delivery Phase 4 Application tuning and optimization Phase 5 Migration from POC to production Phase 6 Data analytics in the cloud Assessment © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 158
  • 132. Phase 2: Use case identification Data analytics in the cloud assessment Phase 1 Architecture and data migration Phase 3 POC delivery Phase 4 Application tuning and optimization Phase 5 Migration from POC to production Phase 6 Use case identification Phase 2 Use case identification © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 162
  • 133. Phase 3: Architecture and data migration Data analytics in the cloud Assessment P H A S E 1 Use case identification P H A S E 2 POC delivery P H A S E 4 Application tuning and optimization P H A S E 5 Migration from POC to production P H A S E 6 Architecture and data migration P H A S E 3 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 168
  • 134. Architecture and data migration: APN Partner best practices Architecture and data migration Phase 3 Engaging AWS Support too late in the process A v o i d Engage AWS AWS Partner Development Managers Partner Solutions Architects AWS Professional Services D o
  • 135. Phase 4: Proof of concept delivery Data Analytics in the cloud assessment P H A S E 1 Use case identification P H A S E 2 Architecture and data migration P H A S E 3 Application tuning and optimization P H A S E 5 Migration from POC to production P H A S E 6 POC delivery P H A S E 4 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 171
  • 136. Phase 5: Application tuning and optimization Data analytics in the cloud assessment P H A S E 1 Use case identification P H A S E 2 Architecture and data migration P H A S E 3 POC Delivery P H A S E 4 Migration from POC to production P H A S E 6 Application tuning and optimization P H A S E 5 Application tuning and optimization © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 173
  • 137. Phase 6: Migration from POC to production Data analytics in the cloud assessment P H A S E 1 Use case identification P H A S E 2 Architecture and data migration P H A S E 3 POC delivery P H A S E 4 Application tuning and optimization P H A S E 5 Migration from POC to production P H A S E 6 Migration from POC to production © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 175
  • 138. 177 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Phase 6: POC to production best practices POC to production Phase 6 • Identify groups and roles in the organization that requested the POC • Create a thought-out plan • Set up a continuous integration and continuous delivery (CI/CD) pipeline • Set up metrics and alarms for production environment • Continue engagement with the customer D o
  • 139. AWS well-architected review using the Analytics Lens © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 183
  • 140. 10 design principles: Analytics applications, 1–5 1. Automate data ingestion to handle big data 2. Design ingestion for failures and duplicates 3. Preserve original source data 4. Describe data with metadata 5. Establish data lineage 184 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Analytics-Lens.pdf
  • 141. 10 design principles: Analytics applications, 6–10 6. Use the right ETL tool for the job 7. Orchestrate ETL workflows 8. Tier storage appropriately 9. Secure, protect, and manage the entire analytics pipeline 10. Design for scalable and reliable analytics pipelines 185 © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Analytics- Lens.pdf
  • 142. Module 6: APN Partner Opportunities and Resources
  • 143. Objectives In this module, you will learn how to: • Describe how to collaborate with AWS for data analytics • Describe AWS Data and Analytics resources for APN Partners: • Competency categories • AWS Immersion Days • AWS Certified Data Analytics and learning resources • Access the AWS Marketplace • Perform the calls to action © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 201
  • 144. APN Partners and AWS for Data Analytics © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 145. Discounting and funding programs Migration programs POC funding © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 203
  • 146. AWS Data and Analytics Competency categories © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Analytics Platforms NoSQL/New SQL Data Integration and Preparation Business Intelligence (BI) and Data Visualization Data Governance and Security Provide a set of integrated tools to solve data analytics challenges within a standard framework Provide highly scalable databases that organize data into a structure Enable customers to move and consolidate data from disparate sources, transform it, and prepare it for analytics Help customers turn raw data into actionable business information, such as reporting, dashboards, and data visualization Help customers discover, categorize, and control their data 204
  • 147. Best practices after identifying an opportunity Use existing Partner programs Cultivate strong relationships with AWS sales teams Register your opportunity through APN Partner Central Achieve AWS Data and Analytics competency © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 205
  • 148. Collaboration workflow Build a reference solution Conduct a big data POC Validate the POC Build and deliver the live solution Receive approval from AWS PSM Engage AWS sales Engage AWS account or Partner SA Register an opportunity on APN Partner Central Before SA involvement Direct SA involvement © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 206
  • 149. AWS Professional Services • Global team of experts • Collaborate with APN Partners to help customers realize their desired business outcomes in AWS Cloud • Reach out to APN Partners when they need additional resources © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. AWS Professional Services: https://aws.amazon.com/professional-services/ 207
  • 150. AWS data analytics solutions and Immersion Days © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 151. AWS Data Lab program • The AWS Data Lab program offers accelerated joint engineering engagements between a team of customer builders and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives. • Two offerings: © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Design Lab Focus on real-world architectural design Build Lab Focus on providing guidance with building a functioning prototype with a customer team Duration Half day to 5 days Location Virtual or AWS Data Lab hub – Seattle, NYC, Herndon (VA), London, Bangalore Cost Free. Reach out to your APN support team for more information. 209 https://aws.amazon.com/aws-data-lab/
  • 152. AWS Immersion Days Designed to help APN Advanced and Premier Consulting Partners deliver technical data analytics workshops to their customers and help grow their businesses © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Data Engineering Immersion Day Build a serverless data lake solution on AWS including modules focusing on ingestion, hydration, exploration, and consumption https://aws.amazon.com/partners/immersion-days/ Amazon EMR Immersion Day Focus on unique facets of Amazon EMR for big data workloads Database Migration Immersion Day Give your customers a head start with the AWS Database Migration Service and the Schema Conversion Tool … and many more. Benefits: Access to technical workshop content, AWS usage credits, Market Development Funds (MDF) opportunities, and support from AWS teams 210
  • 153. AWS Certified data analytics and learning resources © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 154. AWS Technical Professional Learning Path © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 215
  • 155. AWS Certified Data Analytics – Specialty © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. https://aws.amazon.com/certification/certified-data-analytics-specialty/ 216
  • 156. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Partner Cast: Analytics 218
  • 157. Call to action © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 158. Build a data analytic practice on AWS Build packaged solutions Know your Partner Solutions Architect Ask for customer references Engage with AWS service teams Develop customer workshops Achieve an APN competency © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 220
  • 159. Call to action © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. Use the Data Flywheel to perform assessments Work with your Partner team to schedule an Immersion Day for your customers View the analytics customer case studies https://aws.amazon.co m/big-data/datalakes- and-analytics/ Create a specialized service around one of the analytics services Participate in the AWS Data Lab https://aws.amazon.co m/aws-data-lab/ Prepare for the AWS Data Analytics – Specialty certification Build relationships with APN teams for funding opportunities for your marketing and sales efforts 221
  • 160. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections or feedback on the course, please email us at: aws-course-feedback@amazon.com. For all other questions, contact us at: https://aws.amazon.com/contact-us/aws-training/. All trademarks are the property of their owners. Thank You! Parvesh Chopra : choprapa@amazon.com