Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science on AWS - Collision Conference - June 2020

203 views

Published on

https://collisionconf.com/

Collision from Home | June 23-25, 2020 | Join us from your living room. Collision from Home brings together the people and companies redefining the global tech industry. Think of it like working from home: participants will join the event online from their living room

Agenda
* Why Choose AWS for Data Science?
* Amazon Managed Services for Data Science
* Analyze Amazon Customer Reviews Dataset (DEMO)
* Ingest S3 Data with Amazon Athena and Redshift (DEMO)
* Data Analysis with Pandas, Matplotlib, and Amazon SageMaker Notebooks (DEMO)
* Data Quality Checks with Apache Spark and Amazon SageMaker Processing Jobs (DEMO)

Published in: Software
  • Be the first to comment

Data Science on AWS - Collision Conference - June 2020

  1. 1. © 2020, Amazon Web Services, Inc. or its Affiliates. Chris Fregly Developer Advocate AI/ML Amazon Web Services Data Science on AWS
  2. 2. © 2020, Amazon Web Services, Inc. or its Affiliates. Who Am I? Based in San Francisco Meetup Organizer,Advanced SageMaker and Kubeflow Meetup: https://meetup.com/Advanced-Kubeflow/ (12,000+ Members) O’Reilly Author, Data Science on AWS: https://datascienceonaws.com Former Engineer at Netflix (Video Streaming) and Databricks (Spark Streaming)
  3. 3. © 2020, Amazon Web Services, Inc. or its Affiliates. Agenda Why Choose AWS for Data Science? Amazon Managed Services for Data Science DEMOs! Analyze Amazon Customer Reviews Dataset Ingest S3 Data with Amazon Athena and Redshift Data Analysis with Pandas, Matplotlib, and Amazon SageMaker Notebooks Data Quality Checks with Apache Spark and Amazon SageMaker Processing Jobs
  4. 4. © 2020, Amazon Web Services, Inc. or its Affiliates. Why Choose AWS for Data Science?
  5. 5. © 2020, Amazon Web Services, Inc. or its Affiliates. Most secure infrastructure and certifications Most scalable and cost effective options Easiest to build data science solutions Most comprehensive and open Why Choose AWS for Data Science?
  6. 6. © 2020, Amazon Web Services, Inc. or its Affiliates. Build secure data lakes in days (vs. months or years) A single storage layer (S3) for all analytics and ML Deep integration across all AWS analytics and machine learning services including federated queries across different services The fastest way to go from Zero to Business Insights, covering all data for all users Easiest to Build Data Science Solutions
  7. 7. © 2020, Amazon Web Services, Inc. or its Affiliates. Compliance AWS Artifact Amazon Inspector Amazon Cloud HSM Amazon Cognito AWS CloudTrail Security Amazon GuardDuty AWS Shield AWSWAF Amazon Macie VPC Encryption AWS Certification Manager AWS Key Management Service Encryption at rest Encryption in transit Bring your own keys, HSM support Identity AWS IAM AWS SSO Amazon Cloud Directory AWS Directory Service AWS Organizations Customers need to have multiple levels of security, identity and access management, encryption, and compliance to secure their data lake Most Secure Infrastructure
  8. 8. © 2020, Amazon Web Services, Inc. or its Affiliates. CSA Cloud Security Alliance Controls ISO 9001 Global Quality Standard ISO 27001 Security Management Controls ISO 27017 Cloud Specific Controls ISO 27018 Personal Data Protection PCI DSS Level 1 Payment Card Standards SOC 1 Audit Controls Report SOC 2 Security, Availability, & Confidentiality Report SOC 3 General Controls Report Global United States CJIS Criminal Justice Information Services DoD SRG DoD Data Processing FedRAMP Government Data Standards FERPA Educational Privacy Act FIPS Government Security Standards FISMA Federal Information Security Management GxP Quality Guidelines and Regulations ISO FFIEC Financial Institutions Regulation HIPPA Protected Health Information ITAR International Arms Regulations MPAA Protected Media Content NIST National Institute of Standards and Technology SEC Rule 17a-4(f) Financial Data Standards VPAT/Section 508 Accountability Standards Asia Pacific FISC [Japan] Financial Industry Information Systems IRAP [Australia] Australian Security Standards K-ISMS [Korea] Korean Information Security MTCS Tier 3 [Singapore] Multi-Tier Cloud Security Standard My Number Act [Japan] Personal Information Protection Europe C5 [Germany] Operational Security Attestation Cyber Essentials Plus [UK] Cyber Threat Protection G-Cloud [UK] UK Government Standards IT-Grundschutz [Germany] Baseline Protection Methodology X P G Most Certifications
  9. 9. © 2020, Amazon Web Services, Inc. or its Affiliates. Migration & Streaming Services Infrastructure Data Catalog & ETL Security & Management Data Warehousing Big Data Processing Interactive Query Operational Analytics Real time Analytics Serverless Data processing Data movement Analytics Data lake infrastructure & management Dashboards Predictive Analytics Data, visualization, engagement, & machine learning Digital User EngagementData Most Comprehensive and Open
  10. 10. © 2020, Amazon Web Services, Inc. or its Affiliates. Five highly available storage tiers including intelligent tiering Industry leading choice of 200+ instance types to meet workload needs On-demand, Reserved, and Spot instances to reduce costs 100 Gbps bandwidth network interfaces for performance Most Scalable, Cost-Effective, Performant Infrastructure
  11. 11. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Managed Services for Data Science
  12. 12. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon SageMaker Notebooks –Web-Based Environment • Compatible with Jupyter and JupyterLab Notebooks Access your notebooks in seconds Administrators manage access and permissions Share notebooks with a single click Dial up or down compute resources Start your notebooks without spinning up compute resources
  13. 13. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon S3 – Data Lake • Massively Scalable Object Storage • 99.99999999999% Durability (11 9’s) • Global Replication • Cost-Effective Storage Options • Many Partner Integrations
  14. 14. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon Athena – Data Queries • Serverless, Interactive Query Service • Dynamically Scalable to Large Workloads Pay per query Pay only for queries run Save 30–90% on per-query costs through compression Use S3 storage ANSI SQL JDBC/ODBC drivers Multiple formats, compression types, and complex joins and data types SQL Serverless: zero infrastructure, zero administration Integrated with QuickSight EasyQuery instantly Zero setup cost Point to S3 and start querying
  15. 15. © 2020, Amazon Web Services, Inc. or its Affiliates. Best performance, most scalable 3x faster with RA3* 10x faster with AQUA* Adds unlimited compute capacity on- demand to meet unlimited concurrent access Lowest cost Cost-optimized workloads by paying compute and storage separately 1/10th cost ofTraditional DW at $1000/TB/year Up to 75% less than other cloud data warehouses & predictable costs Data lake & AWS integration Analyze exabytes of data across data warehouse, data lakes, and operational database Query data across various analytics services Most secure & compliant AWS-grade security (eg.VPC, encryption with KMS, CloudTrail) All major certifications such as SOC, PCI, DSS, ISO, FedRAMP, HIPPA • Most Popular Cloud Data Warehouse *vs other cloud DWs Amazon Redshift – DataWarehouse
  16. 16. © 2020, Amazon Web Services, Inc. or its Affiliates. Amazon SageMaker Processing Jobs – Data Processing • Large-Scale Data Processing • Supports Apache Spark Use SageMaker’s built-in containers or bring your own Bring your own script for feature engineering Custom processing Achieve distributed processing for clusters Your resources are created, configured, & terminated automatically Leverage SageMaker’s security & compliance features
  17. 17. © 2020, Amazon Web Services, Inc. or its Affiliates. AWS Open Source Libraries AWS DataWrangler Simplify data querying and processing in Python and Pandas • https://github.com/awslabs/aws-data-wrangler • https://aws-data-wrangler.readthedocs.io/en/latest/ AWS Deequ Data Quality Checks for Your Pipelines • https://github.com/awslabs/aws-data-wrangler • https://aws-data-wrangler.readthedocs.io/en/latest/
  18. 18. © 2020, Amazon Web Services, Inc. or its Affiliates. DEMOs! https://github.com/data-science-on-aws/workshop Amazon Customer Reviews Dataset (150+ Million Reviews) https://s3.amazonaws.com/amazon-reviews-pds/readme.html
  19. 19. © 2020, Amazon Web Services, Inc. or its Affiliates. Thank you! Chris Fregly @cfregly https://data-science-on-aws/workshop https://linkedin.com/in/cfregly/

×