More Related Content Similar to AWS Partner Data Analytics on AWS_Handout.pdf (20) AWS Partner Data Analytics on AWS_Handout.pdf3. Course objectives
In this course, you will learn how to:
• Identify Amazon Web Services (AWS) services in the AWS analytics stack
• Describe decision points and technology selections for data analytics architectures
• Design highly available and fault-tolerant serverless data analytics architectures
• Discuss the AWS Data Pipeline and the customer data analytics journey using the Data
Flywheel
• Describe five AWS data analytics technical solutions:
• Modernizing a data warehouse with Amazon Redshift
• Data lakes
• Streaming data
• Data governance
• Machine learning (ML)
• Identify technical engagement strategies and best practices for delivering a proof of
concept (POC)
• Locate and use AWS Partner Network (APN) Partner resources for opportunities and training
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
5
4. About this course
• This course is for technical professionals at APN Consulting Partner
organizations who are engaged in pre-sales discussions with customers to
help architect data analytic solutions on AWS and answer technical questions
about using AWS data analytics services.
• This 1-day course is focused on educating technical professionals with
sufficient technical knowledge on AWS data analytics services and solutions to
successfully engage with and help customers.
• This course is not designed to be a technical deep dive into AWS data
analytics services and solutions. It provides the necessary resources and
learning path towards gaining deeper knowledge into the services.
6
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
5. Agenda
7
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Module 1: Course Introduction
Module 2: AWS Data Analytics Stack
Portfolio
Break
Module 3: AWS Data Analytics Solutions
– Part I
- Data lake solution
Break
Module 4: AWS Data Analytics Solutions
– Part II
Break
Module 5: Technical Engagement
Strategies
Module 6: APN Partner Opportunities
and Resources
7. Objectives
In this module, you will learn how to:
• Understand customer challenges related to data analytics in their business
• Provide a technical overview of AWS data analytics portfolio
• Discuss technical advantages and position of data analytics solutions on
AWS
• Explain how to build a data analytics pipeline
• Explain the Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
9
9. © 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
New realities
By making 10% more data accessible, a typical Fortune 1000
company will see a $65 million increase in net income.*
Explosion of data-
connected devices, apps,
and systems generate
more data than ever
before.
Pay-as-you-go pricing
allows organizations to
analyze data to gain
insights.
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
11
*Source: Forbes Online; New Vantage Partners - Big Data Executive Survey
https://www.forbes.com/sites/cognitiveworld/2019/02/06/data-the-fuel-powering-ai-digital-transformation/#5062b36b578b
Demand growing for faster
decision making on
real-time data.
10. Customers need your help
12
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
85% of businesses want to be data driven,
but only 37% have been successful.
https://www.forbes.com/sites/cognitiveworld/2019/02/06/data-the-fuel-powering-ai-digital-transformation/#51efb027578b
http://newvantage.com/wp-content/uploads/2017/01/Big-Data-Executive-Survey-2017-Executive-Summary.pdf
11. Common data analytics challenges
13
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Top four challenges
involve knowledge, skill,
security, and privacy
This is your opportunity
Data security (unauthorized access to company
data)
Data privacy issues (safety of personal data)
What challenges do you see when using big data
analytics/technologies? (n=545)
Inadequate technical know-how in our company
53%
49%
48%
48%
Inadequate analytical know-how in our company
https://bi-survey.com/challenges-big-data-analytics
12. AWS data analytics portfolio
overview
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
14
13. Secure infrastructure for analytics
Customers need multiple levels of security, identity and access
management, encryption, and compliance to secure their data
lake.
15
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Compliance
AWS Artifact
Amazon Inspector
AWS CloudHSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWS Well-Architected Tool
Amazon Macie
Amazon Virtual Private
Cloud (Amazon VPC)
Encryption
AWS Certificate Manager Private
Certificate Authority (ACM Private CA)
AWS Key Management Service (AWS
KMS)
Encryption at rest
Encryption in transit
Bring your own keys,
hardware security module (HSM)
support
Identity
AWS Identify and Access
Management (IAM)
AWS Single Sign-On
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
14. AWS data analytics portfolio
AWS Database Migration Service (AWS DMS) | AWS Snowball | AWS Snowmobile | Amazon Kinesis Data
Firehose
Amazon Kinesis Data Streams | Amazon Managed Streaming for Apache Kafka
Data movement
© 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved.
16
Amazon
QuickSight
Amazon
SageMaker
Amazon
Comprehend
Amazon
Lex
Amazon
Polly
Amazon
Rekognition
Amazon
Translate
Amazon
Pinpoint
AWS Data
Exchange
Data visualization, engagement, and machine learning
Amazon
Redshift
Amazon EMR
(Spark and Presto)
Amazon
Athena
Amazon
Elasticsearch
Service
Amazon Kinesis
Data Analytics
AWS Glue
(Spark and Python)
Analytics
Amazon Simple Storage Service (Amazon
S3) & Amazon S3 Glacier
AWS
Glue
AWS Lake Formation
Data lake infrastructure and management
15. Data movement services
Help customers move data from on premises to the cloud
17
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS DMS AWS Snowball AWS
Snowmobile
Amazon
Managed
Streaming for
Kafka
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
16. Data lake services
Customers are constrained by volume, variety, veracity, and
velocity of on-premises data, and data silos pose a major challenge.
18
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon S3 Amazon S3 Glacier AWS Lake
Formation
AWS Glue
17. Analytics services
Help customers extract value out of their data
19
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift Amazon EMR AWS Glue
Amazon ES
Amazon
Athena
Amazon Kinesis
Data Analytics
18. Data visualization, engagement, and
machine learning services
Help customers understand and visualize their data, and use
machine learning (ML) for advanced analytics and predictions
20
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
QuickSight
Amazon SageMaker
AWS Data
Exchange
20. Standards, formats, and open source
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
• Apache Flink
• Ganglia
• Apache HBase
• HCatalog
• Hadoop Distributed
File System (HDFS)
• Apache Hive
• Hudi
• Java
• JupyterHub
• Apache Kafka
• Apache Livy
• Apache Mahout
• MapReduce
• Apache MXNet
• MySQL
• Apache Oozie
• Apache ORC
• Apache Parquet
• Phoenix
• Apache Pig
• Presto
• Python
• PyTorch
• R
• Scala
• Apache Spark
• Sqoop
• SQL
• TensorFlow
• Tez
• Yarn
• Apache Zeppelin
• Apache Zookeeper
…and many more
22
21. AWS alternatives to open source
23
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon EMR Amazon ES
Managed Streaming
for Apache Kafka
Real-time
analytics
Kafka
Operational
analytics
Elasticsearch
Logstash
Kibana
Spark, Hive, Presto,
Flink, HBase
Hadoop
Spark
23. © 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data management challenges
How can customers:
• Collect a variety of data types accumulating at varying velocities?
• Collect data from numerous sources accumulating at differing velocities?
• Store massive amounts of data without running out of space?
• Cleanse and augment data quality to be analyzed?
Can they automate these steps?
25
24. Data analytics pipeline
Collect
Store
Process and
analyze
Visualize
Insights
Time-to-answer (latency)
Balance of throughput and cost
Data Insights
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://d1.awsstatic.com/whitepapers/architecture/AWS_Well-Architected_Framework.pdf?did=wp_card&trk=wp_card
26
25. Data pipeline challenges
Building a data pipeline is challenging. Customers must:
• Manage updates, patches, and software integrations
• Handle increased overhead costs plus need for support
• Maintain focus on the core task of building applications that lead to data
insights
27
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
26. AWS data analytics pipeline services
28
Collect Store Process and analyze Visualize
Automate
Amazon
Kinesis Data
Firehose
AWS Direct
Connect
Amazon
Kinesis Data
Streams
AWS
Snowball
Amazon
S3 Glacier
Amazon S3
Amazon DynamoDB Amazon RDS
Amazon Aurora
Amazon
CloudSearch
Amazon ES
Amazon EMR
Amazon Kinesis
Data Analytics
Amazon
QuickSight
Amazon Redshift
Amazon
Athena
AWS Database
Migration Service
Amazon
SageMaker
AWS Glue
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Managed
Streaming for
Kafka
28. 010010010
01010001
100010100
Data Flywheel and customer journey
Build data-driven
applications
Modernize data
warehouse and
build a data
lake
Migrate data and
workloads to the cloud
Save time
Save costs
Store and
manage data
Agility
Global distribution
Scale and performance
New and faster insights
Broader access to
analytics
Innovate with
machine
learning
Better experiences
Deeper engagement
Efficient processes
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved. 30
Attract new customers
Generate more data
Data
https://pages.awscloud.com/data-flywheel.html
29. Summary
In this module, you learned about:
• Customer challenges related to data analytics
• AWS data analytics portfolio
• Technical benefits of AWS data analytics solutions
• Data analytics pipeline
• Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
31
31. Objectives
In this module, you will learn how to:
• Explain data migration options from on premises to the AWS Cloud
• Describe two AWS data analytics technical solutions
• Modernizing a data warehouse with Amazon Redshift
• Data lakes
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
33
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
33. Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Data
warehouse
modernization
100110000100
101011100101
010111001010
100001011111
011010
001111001011
0010110
010001100001
0
Types of data
Data
governance
Machine
learning
Real-time
analytics with
streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
35
34. AWS data migration options
36
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Snowball
AWS Storage
Gateway
Amazon S3 Transfer
Acceleration
AWS Direct
Connect
AWS Database
Migration Service
Amazon Kinesis
Data Firehose
• File gateway
• Tape gateway
• Volume gateway
• Snowball Edge storage
optimized
• AWS Snowmobile
35. Solution 1: Modernizing a
data warehouse with Amazon
Redshift
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
37
36. Journey to a modern data
architecture
Evolution of data architecture
Traditional
data
warehousing
Data lakes
on AWS
Data
warehouse
modernization
100110000100
101011100101
010111001010
100001011111
011010
001111001011
0010110
010001100001
0
Types of data
Data
governanc
e
Machine
learning
Real-time
analytics with
streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
38
38. 42
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Traditional architecture and on-premises
data warehouse challenges
• Difficult to scale
• Long lead times for hardware procurement
• Complex upgrades are the norm
• High overhead costs for administration
• Expensive licensing and support costs
• Proprietary formats do not support newer open data formats, which results in data silos
• Data not cataloged, unreliable quality
• Licensing cost limits number of users and how much data can be accommodated
• Difficult to integrate with services and tools
40. Amazon Redshift
A fully managed data warehouse that is highly integrated
with other AWS services. Features include:
• Optimized for high performance
• Support for open file formats
• Petabyte-scale capability
• Support for complex queries and analytics, with data
visualization tools
• Secure end-to-end encryption and certified compliance
• Service Level Agreement (SLA) of 99.9 percent
• Based on open source Postgres database
• Cost efficient
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://aws.amazon.com/redshift/pricing/
Amazon
Redshift
Secure data warehouse that extends seamlessly to a data
lake
44
41. Amazon Redshift performance
features
Breaks a large job it into
smaller tasks, then distributes
the tasks to multiple compute
nodes
45
Independent and resilient
nodes without any
dependencies
Data from each column is
stored together so the data
can be accessed faster, without
scanning and sorting all other
columns
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Massively parallel
processing (MPP)
Columnar storage Shared-nothing
architecture
Result: Faster processing time Result: Compression of stored
data improves performance
Result: Improves scalability
42. Amazon Redshift architecture
46
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Client
applications
Leader node
Compute Node 1 Compute Node 2
Data warehouse cluster
Java Database
Connectivity
(JDBC)
Open Database
Connectivity
(ODBC)
https://docs.aws.amazon.com/redshift/index.html
Node slices Node slices
43. Leader node
Responsible for communication with the client application
and compute nodes
47
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift leader
node:
• SQL endpoint
• Metadata
• Query compilation and
optimization
• Coordinates parallel
SQL processing
• Machine learning (ML)
optimizations
Leader node
Compute node 1 Compute node 2
Data warehouse cluster
Node slices Node slices
44. Compute node
• SQL running powerhouses
• Compute node can load, unload, backup,
and restore data to and from Amazon S3.
• Node clusters range from 1 to 128.
48
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Runs queries in parallel and returns the result to the leader node
Leader node
Compute node 1 Compute node 2
Data warehouse cluster
Node slices Node slices
45. Compute node slices
49
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Slices are a symmetric multiprocessing (SMP) mechanism.
Slice 1 | Slice 2
Local
disk
Local
disk
Virtual
core
Virtual
core
7.5 GB
RAM
7.5 GB
RAM
• Partitioned into slices.
• Slices work in parallel to
complete operations.
• Virtual processors contained
in each compute node.
• Each slice is allocated an
equal amount of memory,
compute allowance, and disk
space.
• Each slice operates in
parallel but can request data
from other slices.
Compute node 1 Compute node 2
Data warehouse cluster
Node slices Node slices
46. Amazon Redshift instance types
51
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://docs.aws.amazon.com/redshift/latest/gsg/getting-
started.html
47. Management interfaces
52
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://us-west-2.console.aws.amazon.com/redshiftv2/home?region=us-west-
2#query-editor
50. Federated query
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data
warehouse
Amazon
Aurora
OLT
P
ERP CRM LOB
Integrate queries on live data in Amazon RDS
for PostegreSQL and Amazon Aurora
PostgreSQL with queries on Amazon Redshift
and Amazon data lake
Reduce data moved over the network with
Amazon Redshift’s intelligent optimizer.
Pushes and distributes portions of
computation directly into remote operational
databases
Benefits
• Incorporate live data into business
intelligence (BI) and reporting applications
• Ingest data into Amazon Redshift
• Query operational databases directly
• Apply transformations on the fly
• Load data into target tables without
complex ETL pipelines
55
51. Amazon Redshift
lake house architecture
With Amazon Redshift lake house
architecture, customers can:
• Query data in the data lake and
write data back in open formats
• Use familiar SQL statements to
combine and process data across
data stores
• Run queries on live data in
operational databases without
requiring data loading and ETL
pipelines
56
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Redshift lake house queries are run by a fleet of nodes that
are owned and maintained by AWS.
https://aws.amazon.com/redshift/lake-house-architecture/
52. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. 57
SQL clients, business intelligence tools
Leader node
Compute node 1
Node slices
JDBC/ODBC
Compute node 2
Node slices
Amazon S3 AWS Glue Data
Catalog
Amazon Redshift
lake house
Amazon Redshift
lake house fleet
1
SELECT COUNT(*)
FROM S3.EXT_TABLE
GROUP BY…
Query
2
Query is optimized and compiled
using ML at the leader node.
Determine what is run locally and
what goes to Amazon
Redshift lake house.
3 Query plan sent
to all compute
nodes.
4 Compute nodes
obtained from the Data
Catalog; dynamically
prune partitions.
5 Each compute node issues
multiple requests to Amazon
Redshift lake house layers.
6 Amazon Redshift lake house
nodes scan Amazon S3 data.
7 Amazon Redshift lake house
projects, filters, joins, and
aggregates.
8 Final aggregations and join
with local Amazon Redshift
tables done in-cluster.
9 Result is sent to client.
53. Advanced Query Accelerator
(AQUA)
A new distributed and hardware-accelerated cache that makes Amazon Redshift
faster than other cloud data warehouses, without increasing cost
58
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Minimizes data movement over the
network
by pushing operations to Advanced
Query Accelerator (AQUA) nodes
AQUA nodes with custom AWS designed
analytics processors to make operations
(compression, encryption, filtering, and
aggregations) faster than traditional
CPUs
RA3
cluster
AQUA node
Custom
AWS
designed
processor
Running in parallel
Amazon Redshift managed
storage
RA3
cluster
RA3
cluster
AQUA node
Custom
AWS
designed
processor
AQUA node
Custom
AWS
designed
processor
AQUA node
Custom
AWS
designed
processor
54. Migration to Amazon Redshift
59
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
55. Migration pattern
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Migration from a legacy OLAP system
Workload Qualification Framework (WQF) uses the AWS Schema Conversion Tool (AWS SCT) to
generate reports, such as:
• Workload assessment based on complexity, size of migration effort, and technologies
• Recommendations on migration strategies
• Step-by-step instructions for migration
• Assessment of migration effort based on team size and member roles
60
56. AWS SCT data extractors
Extract data from your data warehouse and migrate to Amazon Redshift
• Extracts data through local migration agents
• Data is optimized for Amazon Redshift and saved in local files
• Files are loaded to an Amazon S3 bucket (through network or AWS Snowball Edge)
and then to Amazon Redshift
Amazon
Redshift
AWS SCT Amazon
S3 bucket
Source DW
NETEZZA
Microsoft SQL
Server
57. Equinox sees faster
reports, 80% cost savings
Challenge
Their data warehouse had limited integration, was very expensive,
and required a lot of platform-specific domain knowledge. They
needed to reduce administration and costs, blend structured and
semi-structured data for analytics, and evolve into a data lake
strategy.
Solution
Equinox migrated from a legacy data warehouse to Amazon Redshift to
combine data from disparate sources like clickstream data, cycling log
data, club management software, and more. They land data directly
in an Amazon S3 data lake and perform analytics using Amazon
Redshift, Amazon Redshift Spectrum, and Amazon EMR.
Result
Their monthly Amazon Redshift bill is now 20% of prior yearly
maintenance of their legacy data warehouse. AWS data lake and
analytics reduced report delivery time from months to days.
Amazon Redshift Amazon S3 Amazon EMR
58. Use case: Equinox (continued)
68
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Clickstream
Cycling logs
Club
management
software
Applications
Social
Equinox
applications
Third-party
applications
Maximilia
n (ELT
scripts)
Spark on
Amazon
EMR
• Migrated from Teradata data
warehouse
• Built a data warehouse with
Amazon Redshift and data lake with
Amazon S3
• Analytics on data lake with Amazon
Athena, Amazon Redshift Spectrum,
and Amazon EMR
• Increased user productivity to
move faster
• Amazon Redshift costs
approximately 20% of original
Teradata maintenance and support
• Report time reduced from months
to days
Amazon
Redshift
Amazon
Athena
Amazon EMR
Amazon
Redshift
Amazon S3
59. Solution 2: Data lakes
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
70
60. Journey to a modern data
architecture
Evolution of data architecture
Traditional
data
warehousing
Data lakes
on AWS
Data
warehouse
modernization
100110000100
101011100101
010111001010
100001011111
011010
001111001011
0010110
010001100001
0
Types of data
Data
governance
Machine
learning
Real-time
analytics with
streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
71
61. Data lakes defined
73
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
• Stores all structured, semi-structured,
unstructured, and binary data at unlimited
scale
• Holds curated and raw data
• Uses AWS data analytics tools for analytics
• Increases pace of innovation by extracting
insights from data
• Enables more organizational agility
• Reduces cost and delivers results with
predictive analytics and ML
Architectural approach for a centralized
enterprise data repository stored on
Amazon S3
Machine
learning
Business
intelligence
and
analytics
Data
warehousing
Data lake
Open formats
central catalog
62. Secure data lake on Amazon S3
74
Amazon S3
Access Points
Amazon S3
object lock
Amazon S3
object tags
Amazon S3
Block Public Access
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
FSx for
Lustre
• Multi-tenant bucket
• Dedicated access points
• Customer permissions
from an Amazon Virtual
Private Cloud (Amazon
VPC)
• Across AWS accounts
and Amazon S3 bucket
level
• Specify public
permissions using
Access Control List (ACL)
or policy
• Four settings:
• BlockPublicAcls
• IgnorePublicAcls
• BlockPublicPolicy
• RestrictPublicBuckets
• Access control, lifecycle
policies, and analysis
• Classify data with
metadata
• Use tags to filter objects
• Define replication
policies
• Populate tags with AWS
Lambda functions or S3
Batch Operations
• Immutable Amazon S3
objects
• Retention management
controls
• Data protection and
compliance
https://aws.amazon.com/compliance/services-in-scope
63. 75
IAM
Amazon CloudWatch AWS STS AWS CloudTrail
AWS KMS
Protect and secure
Machine
learning
Amazon QuickSightAmazon EMR
Amazon
Redshift
Amazon
Athena
Processing and analytics
Amazon
Kinesis
AWS
Direct ConnectAWS Snowball
AWS DMS
AWS Data
Exchange
Data ingestion
AWS Glue Amazon ES
Amazon DynamoDB
Catalog and search
Amazon API Gateway IAM Amazon Cognito
Access and user interface
Amazon S3
Central storage
Reference architecture:
Data lake on AWS
64. Data services – AWS Glue
76
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
65. Cleansing data
After migration, data still presents challenges:
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
77
Data is increasingly diverse
• Volume
• Variety
• Velocity
• Veracity
It accumulates rapidly
• Missing or incorrect
data
• Wrong data format
• Partial missing data
Avoid unsearchable data
It must be cleansed before
analyzed by many applications
How can customers provide access to users to gain insights?
66. AWS Glue
78
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Glue Data
Catalog
Job authoring
Job running
Job workflow
Hive metastore compatible with enhanced functionality
Crawlers automatically extracts metadata and creates tables
Integrates with Amazon Athena, Amazon EMR, and many more
Run jobs on a serverless Spark platform
Use flexible scheduling, job monitoring, and alerting
Generates ETL code
Build on open frameworks – Python, Scala, and Apache
Spark
Developer-centric – editing, debugging, sharing
Orchestrate triggers, crawlers, and jobs
Author and monitor entire flows and integrated
alerting
67. AWS Glue crawlers
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
79
Amazon Redshift
Amazon DynamoDB
Amazon S3
Databases
AWS IAM role
AWS Glue crawler
JDBC
connection
NoSQL
connection
Object
connection
Built-in
classifiers
MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Amazon Redshift
Apache Avro
Parquet
ORC
XML
JSON and JSONPaths
AWS CloudTrail
Binary JSON (BSON)
Logs
Delimited
… growing
68. AWS Glue Data Catalog services
80
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Glue Data
Catalog
Amazon
Redshift lake
house
Amazon
Athena
AWS Glue ETL
Amazon EMR
69. Use case: Log aggregation with ETL
81
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS service logs
Web application logs
Server logs
Amazon S3
bucket
AWS Glue
crawler
Update table partition
Create partition
on Amazon S3
Query data
AWS Glue ETL
Amazon S3
bucket
AWS Glue Data
Catalog
Amazon
Athena
70. Data services – AWS Data
Exchange and Amazon Athena
82
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
71. AWS Data Exchange
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Find diverse data in one
place
Analyze data Access third-party data
Find and subscribe to third-party data in the cloud
• More than 1,000 data products
• More than 80 data providers
• Download of copy of data to
Amazon S3
• Combine, analyze, and model
with existing data
• Streamlined access to data
• Minimize legal reviews and
negotiations
83
72. Amazon Athena
84
No setup costs Streamlined
Open
Pay per query
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Interactive query service to analyze data in Amazon S3 using
standard SQL
SQL
$
Zero setup costs,
point to Amazon
S3 and start
querying
Pay only for queries run,
save 30%–90% on
per-query costs through
compression
ANSI SQL interface,
JDBC/ODBC drivers,
multiple formats,
compression types, and
complex joins and data
types
Serverless, zero
infrastructure, zero
administration,
integrated with Amazon
QuickSight
74. Challenges of building a secure data
lake
Typical steps to build a secure data lake
Move data
2 Cleanse,
prepare, and
catalog data
3
Configure and
enforce security
and compliance
policies
4
Make data available
for analytics
5
Set up
storage
1
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
86
Data engineer Data security officer Data analyst
Ingestion and cleaning Security
Analytics and machine learning
75. AWS Lake Formation for a secure data
lake
Secure and
control
Collaborate and
use
Monitor and audit
Ingest and
organize
Automates creating
data lake and data
ingestion.
Sets up fine-grained
access control and
data governance.
Search and data
discovery using Data
Catalog metadata.
To protect data, all
access is checked
against set policies.
Based on data access
and governance
policies, alert
notifications are raised
on policy violation and
logged.
2 3 4
1
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
87
76. AWS Lake Formation benefits
89
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
Redshift
Amazon
Athena
AWS Glue
Amazon EMR
Amazon
QuickSight
Amazon
SageMaker
AWS Lake
Formatio
n
Blueprints ML
Transforms
Data
Catalog
Access
control
Amazon S3
data lake storage
Cost effective, durable
storage includes global
replication capabilities.
Simplified ingest and cleaning
enables data engineers to
build faster.
Centralized management of
fine-grained permissions
empowers security officers.
Comprehensive set of
integrated tools enables every
user equally.
78. Amazon QuickSight
91
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
BI service built for the cloud with pay-per-session pricing and ML insights
Scalable
Automatically scales with use and
activity, with no additional
infrastructure requirements.
Seamlessly grows with customers.
Pay monthly or annually.
With pay-per-session pricing,
customers only pay when they access
their reports and dashboards, with no
upfront costs.
Pay for use
Fully managed cloud application,
meaning there's no upfront cost,
software to deploy, capacity planning,
maintenance, upgrades, or
migrations.
Serverless and fully
managed Deeply integrated with data sources and
other AWS services like Amazon
Redshift, Amazon S3, Athena, Amazon
Aurora, Amazon RDS, IAM, AWS
CloudTrail, and Amazon Cloud
Directory– providing customers with
everything they need for an end-to-end
cloud BI solution.
Fully integrated
79. Serverless data lakes and analytics
Amazon S3
AWS Glue
crawler
AWS Glue Data
Catalog
Amazon
Athena
Amazon EMR
Amazon
Redshift
Spectrum
Amazon
QuickSight
Amazon RDS
Web app data
Other databases
On-premises data
Streaming data
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
92
80. Use case: COVID-19 pandemic
95
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Challenge
The COVID-19 pandemic has
stressed healthcare systems,
businesses, and economies. It
has disrupted the daily lives of
people around the world.
People need a solution to
capture data (diagnosis,
mortality, and recovery rates)
globally in real time, and turn
the data into insights they can
share and respond to with
confidence.
Solution
Amazon worked with APN
Partners Salesforce, Tableau,
and MuleSoft to create a
secure data lake using AWS
Data Exchange, AWS Glue,
Amazon Athena, and Amazon
S3 as a store of trusted data
from open source COVID-19
data providers.
Benefits
Health workers, scientists, and
decision makers can access
and compare international
data to their local data,
enabling understanding and
visualization of the impact of
COVID-19 locally and globally.
This solution enables decision
making and deeper insights to
help manage and flatten the
COVID-19 curve until a
vaccine is available.
81. © 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
96
Use case: COVID-19 data lake architecture
https://d2908q01vomqb2.cloudfront.net/77de68daecd823babbb58
edb1c8e14d7106e83bb/2020/05/29/COVID-19-AWS-Tableau-
Tableau: COVID-19 data platform Visualization for
desktop for users
Upload to Amazon S3
Amazon S3
Amazon S3 Amazon
Athena
AWS Glue
Lambda function Data revision
export to Amazon S3
Define
Athena Schema
AWS Cloud
AWS Data Exchange
Publish and update data products with
AWS Data Exchange
Connect to S3 data with
Amazon Athena
connector in Tableau
82. Summary
97
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governanc
e
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Amazon
Redshift
• Amazon S3
• AWS Glue
• AWS Data Exchange
• Amazon Athena
• AWS Lake
Formation
• Amazon QuickSight
AWS data
migration
options
84. Activity overview
The activity consists of a video demonstration of three key steps:
• Step 1: Build a serverless data lake
• Build a data lake with an AWS CloudFormation template
• Load raw New York City (NYC) taxi data into Amazon S3 bucket
• Program an AWS Glue ETL job to convert raw taxi data into Parquet data storage
format
• Step 2: Run Amazon Athena query
• Run a SQL query with Amazon Athena to query taxi data in Parquet format
• Step 3: Visualize data with Amazon QuickSight
• Use Amazon Athena to visualize data with Amazon QuickSight
99
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-
trigger-for-the-data-catalog-and-etl-jobs/
85. Step 1: Serverless data lake architecture
100
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Glue
Crawle
r
AWS
Lambda
Amazon S3 Amazon
CloudWatch
Amazon SQS
Amazon SNS
AWS
Lambda
Amazon S3
Amazon
CloudWatch
AWS Glue
Raw zone Processed zone
Email notification
ETL job
87. Objectives
In this module, you will learn about three key types of data
analytics technical solutions on AWS:
• Streaming and real-time analytics with Amazon Kinesis
• Data governance
• Extended solution: Insights and monetization with machine learning (ML)
108
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
88. Solution 3: Streaming and
real-time analytics with
Amazon Kinesis
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
109
89. Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Types of data used
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
110
90. Streaming data defined
111
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data that is generated continuously from thousands
of data sources, sent simultaneously
Player-game
interactions Geolocation of
cars and devices
Music
downloads
Website clicks
Social media
streams
91. Common use cases: Real-time
analytics
112
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Milliseconds Seconds Minutes Hours
• Messaging between
microservices
• Response analytics
(web and mobile
application
notifications)
• Log ingestion
• Internet of Things (IoT)
device maintenance
• Change data capture
(CDC)
• Streaming ETL
into data lakes
and data
warehouse
The value of data diminishes over time
92. Enabling real-time analytics
113
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data streaming technology enables a customer to ingest, process, and
analyze high volumes of high-velocity data from a variety of sources, in real
time.
1. 2. 3. 4. 5.
93. Data streaming solution challenges
Difficult to set up
Difficult to achieve high
availability
Error prone and complex to
manage
Tricky to scale
Integration requires
development
Expensive to maintain
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
114
Challenges of building on-premises, real-time streaming solutions:
94. AWS streaming data solutions
Efficiently collect, process, and analyze data streams in real
time
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
115
Amazon Kinesis
Data Analytics
95. Data generators: Simple streaming
data patterns
116
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data producers Streaming services Data consumers
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Streams
Mobile and
applications
Amazon Kinesis Agent
Amazon Kinesis Data
Streams
Amazon CloudWatch Logs
Amazon CloudWatch
Events
AWS IoT
Apache Kafka
Amazon Kinesis Producer
Library (KPL)
Amazon EMR
Amazon Redshift
Amazon Simple
Storage Service (S3)
Amazon EC2
Amazon Kinesis
Connector
Library
96. Amazon Kinesis Data Streams
117
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
97. Amazon Kinesis Data Streams
118
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Massively scalable, highly durable data ingestion and processing
service optimized for real-time data streaming
No upfront cost
low, pay-as-you-
go pricing
70
Data collected is
available within
milliseconds
Real-time analytics
• Dashboards
• Anomaly detection
• Dynamic pricing
Data synchronously
replicates data
across
3 Availability
Zones in a Region
Data can be stored up
to 7 Days
Serverless, can scale
dynamically to handle
MB to TB Thousands to
millions
each hour
of PutRecords
each second
and
https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
98. How Kinesis Data Streams works
119
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis
Data Analytics
Amazon EC2
AWS Lambda
Input
Output
Spark on Amazon EMR
Amazon
Kinesis Data
Streams
Capture and send data Ingest and store data
streams for processing
Build custom, real-time
applications
Analyze streaming data
using BI tools
99. Kinesis Data Streams architecture
120
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon EC2
instances
Client
Mobile client
Traditional
server
Data
producers
Shard
1
Shard
2
Shard
N
Amazon
Kinesis Data
Stream
EC2
instance
EC2
instance
Data
consumers
Amazon Redshift
Amazon S3
Amazon
Kinesis Data
Firehose
Amazon EMR
Amazon DynamoDB
Shard 1
Data
record
• Sequence #
• Partition Key
• Data blob
Data stream
https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
Amazon
Kinesis Data
Firehose
Amazon
Kinesis Data
Analytics
100. Kinesis Data Streams provisioning
121
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
101. Amazon Kinesis Data Firehose
122
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
102. How Kinesis Data Firehose works
123
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
Kinesis Data
Firehose
Input
Output
Splunk
Amazon Redshift
Amazon S3
Amazon
Elasticsearch Service
Capture and send data Prepares and loads data
continuously to the
selected destinations
Durably store the data
for analytics
Analyze streaming data
using analytics tools
103. Kinesis Data Streams and
Kinesis Data Firehose
124
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Characteristics Amazon Kinesis Data Streams Amazon Kinesis Data Firehose
Processing time
As fast as 70 milliseconds after
ingestion
Between 60–900 seconds
Stream storage and
duration
In shards, default 24 hours and up to 7
days
Max buffer size 128 MB and max time 900
seconds
Data transformation
and conversion
None Uses AWS Lambda and AWS Glue
Data producer
Amazon Kinesis Agent, applications using Amazon Kinesis Producer Library (KPL),
AWS SDK for Java, Amazon CloudWatch Logs and CloudWatch Events, AWS IoT
Data consumer
AWS Lambda, Amazon Kinesis Data
Analytics, Amazon Kinesis Data
Firehose, Applications using the Kinesis
Client Library (KCL) and SDK for Java
AWS Lambda, Amazon Kinesis Data
Analytics, and Kinesis Data Firehose, apps
using the KCL and SWK for Java, Amazon
S3, Amazon Redshift, Amazon ES, Splunk,
and Amazon Kinesis Data Analytics
Data compression None gzip, Snappy, Zip, or no data compression
https://aws.amazon.com/kinesis/data-streams/faqs/?nc=sn&loc=5
https://aws.amazon.com/kinesis/data-firehose/faqs/?nc=sn&loc=5
104. When to use Kinesis Data Streams
and Kinesis Data Firehose
125
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Kinesis
Data Firehose
For data streaming applications with massive ingestion requirements
• Requires data to be sent to consumer analytics services for millisecond
response time
• Massively scalable
• Data retention time ranging from hours to days
• Example: Real-time gaming
Amazon Kinesis
Data Streams
For data streaming applications that require near real-time responses in
seconds
• Need for data augmentation, data transformation, or data compression
• Need to save data to Amazon S3, Amazon Redshift, Amazon ES, Splunk,
or send data to Amazon Kinesis Data Analytics for analytics
• Example: Log analytics
105. Amazon Kinesis Data Analytics
126
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
106. Amazon Kinesis Data Analytics
127
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Input
Amazon Kinesis
Data Analytics Output
Capture streaming data
with Amazon MSK,
Amazon Kinesis Data
Streams, Amazon Kinesis
Data Firehose, or other
data sources
Query and analyze
streaming data
Send processes data
to analytics tools to
create alerts and
respond in real time
107. Kinesis data analytics application
details
128
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
108. Use case: Clickstream analytics
s129
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon
Kinesis Data
Firehose
Input Output
Amazon
Kinesis Data
Firehose
Amazon
Kinesis Data
Analytics
Amazon Redshift
Evolve from batch processing to real-time analytics
Websites send
clickstream data
Collects the data
and sends to Kinesis
Data Analytics
Processes data in
near-real time
Loads
processed data
into Amazon
Redshift
Runs analytics
models to
identify content
recommendatio
ns
Readers see
personalized
content
suggestions and
increase
engagement
109. Put it all together:
Streaming data analytics with
AWS
130
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
110. Streaming data analytics
architecture
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
131
Amazon
Redshift
Amazon
RDS
DynamoDB
Kinesis
Data Streams
Kinesis
Data Firehose
Kinesis
Data Analytics
Amazon
Elasticsearch
Service
Amazon S3
data lake
AWS Lambda
Amazon Simple
Notification Service
Amazon
Kinesis
enabled
applications
Millions of
data sources
Machine
learning
Kinesis
Data Streams
Kinesis
Data Firehose
Data science
Reporting
Logs and
processed data
Downstream
applications
Alerts Notification
s
1
2
3
4
5
Fan-out
111. Solution 4: Data governance
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
135
112. Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Types of data used
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
136
113. Challenges of data in data lakes
• Securing data
• Auditing data usage
• Managing data access
• Safeguarding sensitive data and PII
• Maintaining regulations and
mandates
137
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
114. Data security and governance
© ENTERPRISE STRATEGY GROUP, 2019.
With big data comes big
responsibility.
More than one in three companies cite data privacy and
governance as a hurdle to both digital transformation and IoT
initiatives
34% 37%
of IT decision makers cite ensuring
data governance/privacy as one of
their organization’s biggest digital
transformation challenges
of IT decision makers cite ensuring
security/compliance upon movement
of data as one of their most
important IoT priorities over the next
18–24 months
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
138
https://www.esg-global.com/hubfs/ESG-Infographic-IT-Spending-Intentions-
115. Resolving PII dangers
139
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Personally
identifiable
information
(PII)
Consumer
consent
violation
Data
breach
Spyware
Unsecured
devices
Rogue
agents
Second-
party
misuse
Espionage
External
hacking
• Do these issues need to be
resolved?
• Is there a solution
architecture that solves all
PII issues?
• What best practices can be
used to mitigate PII
dangers?
116. Amazon Macie
140
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Amazon Macie
Continually evaluate
Amazon S3
environment
Discover
sensitive data
Take action
Enable Amazon
Macie with one-
click in the AWS
Management
Console or with a
single API call
Automatically
generates an
inventory of
Amazon S3 bucket
and details on the
bucket-level
security and access
controls
Analyzes bucket using
ML and pattern
matching to discover
sensitive data, like PII
Generates findings
and sends to
Amazon
CloudWatch
Events for
integration into
workflows and
remediation
actions
• Financial
• Personal
• National
• Medical
• Credentials and
secrets
117. De-identified data lake (DIDL) on AWS
A de-identified data lake (DIDL) is an architectural approach that reduces the
risks associated with managing data, particularly personally identifiable
information (PII).
Benefits
Reduce risk
• Remove PII before it enters a data lake
Understand all the data
• Create a Data Catalog of an entire data lake
Reduce compliance costs
• Automate the discovery, classification, de-identification,
and ongoing monitoring of data across an organization
Turn data into an asset, not a liability
• Enable a broader set of governed analytic and machine learning use cases
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
141
118. Masking PII data
142
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Email Customer ID Transcript
csalazar@example.com 19664 Just talked to Carlos Salazar
mary@example.com 23423 Mary’s SSN is 000000000
mateo@example.com 99644 Mateo is moving to Nevada
NA 02945
It is expected to rain
tomorrow
Email Customer ID Transcript
4t34gttt 7462391 Just talked to Jane Roe
44e5325 1239474 Jorge’s SSN is 666666666
0we&yrw 9983487 Sofia is moving to Texas
NA 3344325
It is expected to rain
tomorrow
Email ID Name, SSN, State
119. Extended solution 5: Insights
and monetization with ML on
AWS
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
143
120. Journey to a modern data
architecture
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governanc
e
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
Types of data used
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
144
121. Data lakes and machine learning
Machine learning requires:
• More data: Collect all types of data
• Flexibility: Define schema during analysis
• Scalability: Scale storage and compute (CPU
or GPU) independently
• Data transformation and processing: Run a
broad set of processing and analytics on the
same data without movement
• Security: Networking, identity, encryption, and
compliance
OLTP ERP CRM LOB
Data warehouse
Business
analytics
10011000010010101
11001010101110010
10100001011111011
010
00111100101100101
10
0100011000010
Data lake
Device
s
We
b
Sensor
s
Social
Data Catalog
AI and
machine learning
Data warehouse
queries
Big data
processing
Interactive Real time
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
145
122. Amazon SageMaker
146
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Machine learning at enterprise scale
Build
Train and tune
Deploy and manage
Notebooks for
common
problems
High-
performance
algorithms
• Managed Jupyter for enterprise data science
• Sample notebooks for most common use
cases
• Single-pass, streaming training algorithms
One-click
training
Hyperparameter
optimization
One-click
deployment
Fully managed
elastic hosting
• Training models at scale without DevOps
assistance
• ML on ML to optimize hyperparameters
• Deploy to production with a single call
• Fully managed, production-grade inferences
https://aws.amazon.com/machine-learning/?nc2=h_ql_prod_ml
123. Machine learning resources
• Fundamental digital course
on how SageMaker
mitigates the core
challenges of implementing
an ML pipeline
• Duration: 30 minutes
• https://www.aws.training/De
tails/Video?id=49646
148
• Explore how to use the
machine learning pipeline to
solve a real business
problem (intermediate)
• Duration: 4 days
• https://www.aws.training/Se
ssionSearch?pageNumber=1
&courseId=38910
• Learn to solve real-world use
cases with machine learning
(intermediate)
• Duration: 1 day
• https://www.aws.training/Se
ssionSearch?pageNumber=1
&courseId=40748
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Foundations: How
Amazon SageMaker Can Help
Practical Data Science with
Amazon SageMaker
The Machine Learning Pipeline
on AWS
https://partnercentral.awspartner.com/LmsSsoRedirect?RelayState=%2flearningobject%
2fcurriculum%3fid%3d25521
AWS STP: Machine Learning (ML) on AWS for ML Practitioners - Technical
124. Summary
150
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Evolution of data architecture
Traditional
data warehousing
Data lakes
on AWS
Real-time
analytics with
streaming data
Data
warehouse
modernization
Data
governance
10011000010010101110010
10101110010101000010111
11011010
0011110010110010110
0100011000010
Machine
learning
• Kinesis Data
Streams
• Kinesis Data
Firehose
• Kinesis Data
Analytics
Amazon Macie Amazon
SageMaker
127. The Data Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
3. Build
data-driven applications
4. Analyze with
data lake
architectures
1. Move and store
data in the cloud
2. Move and manage all
workloads in the cloud
5. Innovate with
machine learning
154
128. Conversations using the Data
Flywheel
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
3. Build
data-driven apps
4. Analyze with
data lake
architectures
5. Innovate with
machine learning
1. Move and store
data in the cloud
2. Move and manage all
workloads in the cloud
155
129. AWS six-phase strategy
for implementing a data
analytics solution
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
156
130. Data analytics in
the cloud
assessment
Phase 1
Use case
Identification
Phase 2
Architecture
and data
migration
Phase 3
POC
delivery
Phase 4
Application
tuning and
optimization
Phase 5
Migration
from POC to
production
Phase 6
Data analytics projects: A phased strategy
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
157
131. Phase 1: Data analytics in the cloud
assessment
Phase 1
Use case
identification
Phase 2
Architecture
and data
migration
Phase 3
POC
Delivery
Phase 4
Application
tuning and
optimization
Phase 5
Migration
from POC to
production
Phase 6
Data analytics
in the cloud
Assessment
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
158
132. Phase 2: Use case identification
Data
analytics in
the cloud
assessment
Phase 1
Architecture
and data
migration
Phase 3
POC
delivery
Phase 4
Application
tuning and
optimization
Phase 5
Migration
from POC to
production
Phase 6
Use case
identification
Phase 2
Use case
identification
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
162
133. Phase 3: Architecture and data
migration
Data
analytics in
the cloud
Assessment
P H A S E 1
Use case
identification
P H A S E 2
POC
delivery
P H A S E 4
Application
tuning and
optimization
P H A S E 5
Migration
from POC to
production
P H A S E 6
Architecture
and data
migration
P H A S E 3
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
168
134. Architecture and data migration: APN Partner
best practices
Architecture
and data
migration
Phase 3
Engaging AWS
Support too late in the
process
A v o i d
Engage AWS
AWS Partner
Development Managers
Partner Solutions
Architects
AWS Professional
Services
D o
135. Phase 4: Proof of concept delivery
Data
Analytics in
the cloud
assessment
P H A S E 1
Use case
identification
P H A S E 2
Architecture
and data
migration
P H A S E 3
Application
tuning and
optimization
P H A S E 5
Migration
from POC to
production
P H A S E 6
POC delivery
P H A S E 4
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
171
136. Phase 5: Application tuning
and optimization
Data
analytics in
the cloud
assessment
P H A S E 1
Use case
identification
P H A S E 2
Architecture
and data
migration
P H A S E 3
POC
Delivery
P H A S E 4
Migration
from POC to
production
P H A S E 6
Application
tuning and
optimization
P H A S E 5
Application
tuning and
optimization
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
173
137. Phase 6: Migration from POC
to production
Data
analytics in
the cloud
assessment
P H A S E 1
Use case
identification
P H A S E 2
Architecture
and data
migration
P H A S E 3
POC
delivery
P H A S E 4
Application
tuning and
optimization
P H A S E 5
Migration
from POC to
production
P H A S E 6
Migration
from POC to
production
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
175
138. 177
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Phase 6: POC to production best practices
POC to
production
Phase 6
• Identify groups and roles in the
organization that requested the POC
• Create a thought-out plan
• Set up a continuous integration and
continuous delivery (CI/CD) pipeline
• Set up metrics and alarms for
production environment
• Continue engagement with the
customer
D o
140. 10 design principles:
Analytics applications, 1–5
1. Automate data ingestion to handle big data
2. Design ingestion for failures and duplicates
3. Preserve original source data
4. Describe data with metadata
5. Establish data lineage
184
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Analytics-Lens.pdf
141. 10 design principles:
Analytics applications, 6–10
6. Use the right ETL tool for the job
7. Orchestrate ETL workflows
8. Tier storage appropriately
9. Secure, protect, and manage the entire analytics pipeline
10. Design for scalable and reliable analytics pipelines
185
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
https://d1.awsstatic.com/whitepapers/architecture/wellarchitected-Analytics-
Lens.pdf
143. Objectives
In this module, you will learn how to:
• Describe how to collaborate with AWS for data analytics
• Describe AWS Data and Analytics resources for APN Partners:
• Competency categories
• AWS Immersion Days
• AWS Certified Data Analytics and learning resources
• Access the AWS Marketplace
• Perform the calls to action
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
201
144. APN Partners and
AWS for Data Analytics
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
145. Discounting and funding programs
Migration
programs
POC funding
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
203
146. AWS Data and Analytics
Competency categories
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data Analytics
Platforms
NoSQL/New SQL
Data Integration and
Preparation
Business Intelligence
(BI) and Data
Visualization
Data Governance and
Security
Provide a set of integrated tools to solve data
analytics challenges within a standard
framework
Provide highly scalable databases that
organize data into a structure
Enable customers to move and consolidate
data from disparate sources, transform it,
and prepare it for analytics
Help customers turn raw data into actionable business
information, such as reporting, dashboards, and data
visualization
Help customers discover, categorize, and control their
data
204
147. Best practices after identifying an
opportunity
Use existing Partner
programs
Cultivate strong
relationships with
AWS sales teams
Register your
opportunity
through
APN Partner Central
Achieve AWS Data and
Analytics competency
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
205
148. Collaboration workflow
Build a
reference
solution
Conduct a big
data POC
Validate the
POC
Build and
deliver the live
solution
Receive
approval from
AWS PSM
Engage
AWS sales
Engage AWS
account or
Partner SA
Register an
opportunity on
APN Partner
Central
Before SA
involvement
Direct SA
involvement
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
206
149. AWS Professional Services
• Global team of experts
• Collaborate with APN Partners to help customers realize their
desired business outcomes in AWS Cloud
• Reach out to APN Partners when they need additional resources
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
AWS Professional Services: https://aws.amazon.com/professional-services/
207
150. AWS data analytics solutions
and Immersion Days
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
151. AWS Data Lab program
• The AWS Data Lab program offers accelerated joint engineering
engagements between a team of customer builders and AWS
technical resources to create tangible deliverables that
accelerate data and analytics modernization initiatives.
• Two offerings:
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Design Lab
Focus on real-world
architectural design
Build Lab
Focus on providing
guidance with
building a
functioning
prototype with a
customer team
Duration
Half day to 5 days
Location
Virtual or AWS Data Lab hub – Seattle,
NYC, Herndon (VA), London, Bangalore
Cost
Free. Reach out to your APN support
team for more information.
209
https://aws.amazon.com/aws-data-lab/
152. AWS Immersion Days
Designed to help APN Advanced and Premier Consulting Partners deliver technical data
analytics workshops to their customers and help grow their businesses
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Data Engineering
Immersion Day
Build a serverless data lake
solution on AWS including
modules focusing on
ingestion, hydration,
exploration, and
consumption
https://aws.amazon.com/partners/immersion-days/
Amazon EMR
Immersion Day
Focus on unique facets of
Amazon EMR for big data
workloads
Database Migration
Immersion Day
Give your customers a head
start with the AWS Database
Migration Service and the
Schema Conversion Tool
… and many more.
Benefits: Access to technical workshop content, AWS usage credits, Market Development
Funds (MDF) opportunities, and support from AWS teams
210
153. AWS Certified data analytics
and learning resources
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
155. AWS Certified Data Analytics –
Specialty
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved. https://aws.amazon.com/certification/certified-data-analytics-specialty/ 216
156. © 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Partner Cast: Analytics
218
157. Call to action
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
158. Build a data analytic practice on AWS
Build packaged
solutions
Know your
Partner Solutions
Architect
Ask for customer
references
Engage with AWS
service teams
Develop
customer
workshops
Achieve an APN
competency
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
220
159. Call to action
© 2020 Amazon Web Services, Inc. or its affiliates. All rights
reserved.
Use the Data
Flywheel to perform
assessments
Work with your
Partner team to
schedule an
Immersion Day for
your customers
View the analytics
customer case
studies
https://aws.amazon.co
m/big-data/datalakes-
and-analytics/
Create a specialized
service around one
of the analytics
services
Participate in the
AWS Data Lab
https://aws.amazon.co
m/aws-data-lab/
Prepare for the AWS
Data Analytics –
Specialty
certification
Build relationships
with APN teams for
funding
opportunities for
your marketing and
sales efforts
221
160. © 2020 Amazon Web Services, Inc. or its affiliates. All rights reserved. This work may not be reproduced or redistributed, in whole or in part, without prior
written permission from Amazon Web Services, Inc. Commercial copying, lending, or selling is prohibited. Corrections or feedback on the course, please email
us at: aws-course-feedback@amazon.com. For all other questions, contact us at: https://aws.amazon.com/contact-us/aws-training/. All trademarks are the
property of their owners.
Thank You!
Parvesh Chopra : choprapa@amazon.com