More Related Content More from Chris Fregly (20) Data Science on AWS - Collision Conference - June 2020 1. © 2020, Amazon Web Services, Inc. or its Affiliates.
Chris Fregly
Developer Advocate AI/ML
Amazon Web Services
Data Science on AWS
2. © 2020, Amazon Web Services, Inc. or its Affiliates.
Who Am I?
Based in San Francisco
Meetup Organizer,Advanced SageMaker and Kubeflow Meetup:
https://meetup.com/Advanced-Kubeflow/ (12,000+ Members)
O’Reilly Author, Data Science on AWS: https://datascienceonaws.com
Former Engineer at Netflix (Video Streaming) and Databricks (Spark Streaming)
3. © 2020, Amazon Web Services, Inc. or its Affiliates.
Agenda
Why Choose AWS for Data Science?
Amazon Managed Services for Data Science
DEMOs!
Analyze Amazon Customer Reviews Dataset
Ingest S3 Data with Amazon Athena and Redshift
Data Analysis with Pandas, Matplotlib, and Amazon SageMaker Notebooks
Data Quality Checks with Apache Spark and Amazon SageMaker Processing Jobs
4. © 2020, Amazon Web Services, Inc. or its Affiliates.
Why Choose AWS for Data
Science?
5. © 2020, Amazon Web Services, Inc. or its Affiliates.
Most secure
infrastructure and
certifications
Most scalable and
cost effective
options
Easiest to build
data science
solutions
Most
comprehensive and
open
Why Choose AWS for Data Science?
6. © 2020, Amazon Web Services, Inc. or its Affiliates.
Build secure data lakes in days (vs. months or years)
A single storage layer (S3) for all analytics and ML
Deep integration across all AWS analytics and machine learning services
including federated queries across different services
The fastest way to go from Zero to Business Insights,
covering all data for all users
Easiest to Build Data Science Solutions
7. © 2020, Amazon Web Services, Inc. or its Affiliates.
Compliance
AWS Artifact
Amazon Inspector
Amazon Cloud HSM
Amazon Cognito
AWS CloudTrail
Security
Amazon GuardDuty
AWS Shield
AWSWAF
Amazon Macie
VPC
Encryption
AWS Certification Manager
AWS Key Management Service
Encryption at rest
Encryption in transit
Bring your own keys,
HSM support
Identity
AWS IAM
AWS SSO
Amazon Cloud Directory
AWS Directory Service
AWS Organizations
Customers need to have multiple levels of security, identity and access management, encryption, and
compliance to secure their data lake
Most Secure Infrastructure
8. © 2020, Amazon Web Services, Inc. or its Affiliates.
CSA
Cloud Security
Alliance Controls
ISO 9001
Global Quality
Standard
ISO 27001
Security Management
Controls
ISO 27017
Cloud Specific
Controls
ISO 27018
Personal Data
Protection
PCI DSS Level 1
Payment Card
Standards
SOC 1
Audit Controls
Report
SOC 2
Security, Availability, &
Confidentiality Report
SOC 3
General Controls
Report
Global United States
CJIS
Criminal Justice
Information Services
DoD SRG
DoD Data
Processing
FedRAMP
Government Data
Standards
FERPA
Educational
Privacy Act
FIPS
Government Security
Standards
FISMA
Federal Information
Security Management
GxP
Quality Guidelines
and Regulations
ISO FFIEC
Financial Institutions
Regulation
HIPPA
Protected Health
Information
ITAR
International Arms
Regulations
MPAA
Protected Media
Content
NIST
National Institute of
Standards and Technology
SEC Rule 17a-4(f)
Financial Data
Standards
VPAT/Section 508
Accountability
Standards
Asia Pacific
FISC [Japan]
Financial Industry
Information Systems
IRAP [Australia]
Australian Security
Standards
K-ISMS [Korea]
Korean Information
Security
MTCS Tier 3 [Singapore]
Multi-Tier Cloud
Security Standard
My Number Act [Japan]
Personal Information
Protection
Europe
C5 [Germany]
Operational Security
Attestation
Cyber Essentials
Plus [UK]
Cyber Threat
Protection
G-Cloud [UK]
UK Government
Standards
IT-Grundschutz
[Germany]
Baseline Protection
Methodology
X P
G
Most Certifications
9. © 2020, Amazon Web Services, Inc. or its Affiliates.
Migration & Streaming Services
Infrastructure Data Catalog
& ETL
Security &
Management
Data
Warehousing
Big Data
Processing
Interactive
Query
Operational
Analytics
Real time
Analytics
Serverless
Data processing
Data movement
Analytics
Data lake infrastructure & management
Dashboards Predictive Analytics
Data, visualization, engagement, & machine learning
Digital User EngagementData
Most Comprehensive and Open
10. © 2020, Amazon Web Services, Inc. or its Affiliates.
Five highly
available storage
tiers including
intelligent tiering
Industry leading choice
of 200+ instance types
to meet workload
needs
On-demand,
Reserved, and
Spot instances to
reduce costs
100 Gbps
bandwidth
network interfaces
for performance
Most Scalable, Cost-Effective, Performant Infrastructure
11. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Managed Services
for Data Science
12. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon SageMaker Notebooks –Web-Based Environment
• Compatible with Jupyter and JupyterLab Notebooks
Access your notebooks in
seconds
Administrators manage access
and permissions
Share notebooks
with a single click
Dial up or down
compute resources
Start your notebooks
without spinning up
compute resources
13. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon S3 – Data Lake
• Massively Scalable Object Storage
• 99.99999999999% Durability (11 9’s)
• Global Replication
• Cost-Effective Storage Options
• Many Partner Integrations
14. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon Athena – Data Queries
• Serverless, Interactive Query Service
• Dynamically Scalable to Large Workloads
Pay per query
Pay only for queries run
Save 30–90% on per-query costs
through compression
Use S3 storage
ANSI SQL
JDBC/ODBC drivers
Multiple formats, compression
types, and complex joins and
data types
SQL
Serverless: zero infrastructure,
zero administration
Integrated with QuickSight
EasyQuery instantly
Zero setup cost
Point to S3 and start querying
15. © 2020, Amazon Web Services, Inc. or its Affiliates.
Best performance,
most scalable
3x faster with RA3*
10x faster with AQUA*
Adds unlimited compute capacity on-
demand to meet unlimited concurrent
access
Lowest cost
Cost-optimized workloads
by paying compute and
storage separately
1/10th cost ofTraditional
DW at $1000/TB/year
Up to 75% less than other cloud
data warehouses & predictable
costs
Data lake &
AWS integration
Analyze exabytes of data across data
warehouse, data lakes, and
operational database
Query data across various analytics
services
Most secure
& compliant
AWS-grade security (eg.VPC,
encryption with KMS, CloudTrail)
All major certifications such
as SOC, PCI, DSS, ISO,
FedRAMP, HIPPA
• Most Popular Cloud Data Warehouse
*vs other cloud DWs
Amazon Redshift – DataWarehouse
16. © 2020, Amazon Web Services, Inc. or its Affiliates.
Amazon SageMaker Processing Jobs – Data Processing
• Large-Scale Data Processing
• Supports Apache Spark
Use SageMaker’s built-in
containers or bring your own
Bring your own script for
feature engineering
Custom processing
Achieve distributed
processing for clusters
Your resources are created,
configured, & terminated
automatically
Leverage SageMaker’s
security & compliance
features
17. © 2020, Amazon Web Services, Inc. or its Affiliates.
AWS Open Source Libraries
AWS DataWrangler
Simplify data querying and processing in Python and Pandas
• https://github.com/awslabs/aws-data-wrangler
• https://aws-data-wrangler.readthedocs.io/en/latest/
AWS Deequ
Data Quality Checks for Your Pipelines
• https://github.com/awslabs/aws-data-wrangler
• https://aws-data-wrangler.readthedocs.io/en/latest/
18. © 2020, Amazon Web Services, Inc. or its Affiliates.
DEMOs!
https://github.com/data-science-on-aws/workshop
Amazon Customer Reviews Dataset
(150+ Million Reviews)
https://s3.amazonaws.com/amazon-reviews-pds/readme.html
19. © 2020, Amazon Web Services, Inc. or its Affiliates.
Thank you!
Chris Fregly @cfregly
https://data-science-on-aws/workshop
https://linkedin.com/in/cfregly/