Scaling Pandas on AWS
Let me count the ways!
Chris Fregly, Principal Solution Architect @ AWS
Agenda
Updates from re:Invent 2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
Updates from AWS re:Invent 2022 (this week!)
Focused on SageMaker usability, collaboration, and notebook-as-jobs
First-class support for large language and generative models like Stable Diffusion
Introduces Serverless Ray (ray.io) to AWS services including SageMaker and Glue
Agenda
Updates from re:Invent 2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
Amazon Code Whisperer
AWS SDK for Pandas
Modin
Pandas
Spark
Ray
…
Everything!
DEMOs!
Agenda
Updates from re:Invent 2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
Quick overview of AWS services for data and AI/ML
AI and machine learning
Data and analytics
Agenda
Updates from re:Invent 2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
Let’s search the internet for “Pandas on AWS”
AWS SDK for Pandas - Python library featuring Modin!
Captures AWS best practices including concurrent reads/writes and security
Developed and maintained by AWS Professional Services and Solution Architects
Allows Pandas to scale to the cloud!
Why does scale matter?
Big data is, well, big! Requires a lot of RAM to analyze
Data that doesn’t fit into a single server’s RAM needs to run on a cluster
Preferably a dedicated, serverless cluster
Dedicated cluster avoids contention with other users/jobs (long-running)
Serverless reduces cost because we pay for only what we use - less idle time
Fortunately, AWS has innovated on fast-start (<1 sec) dedicated serverless clusters
Based on Firecracker open source project
https://firecracker-microvm.github.io
Which AWS services supported by AWS SDK for Pandas?
● Amazon S3
● AWS Glue Catalog
● Amazon Athena
● AWS Lake Formation
● Amazon Redshift
● PostgreSQL
● MySQL
● SQL Server
● Oracle
● Data API Redshift
● Data API RDS
● OpenSearch
● Amazon Neptune
● DynamoDB
● Amazon Timestream
● Amazon EMR
● Amazon CloudWatch Logs
● Amazon Chime
● Amazon QuickSight
● AWS STS
● AWS Secrets Manager
● Global Configurations
Where can I run the AWS SDK for Pandas library?
Local laptop - limited by RAM
AWS Glue notebooks/jobs, interactive serverless clusters
Amazon SageMaker Studio notebooks/jobs, interactive serverless clusters thru Glue
Amazon Elastic MapReduce (EMR) Studio notebooks/jobs - serverless clusters (2021)
Lots of tutorials for the AWS SDK for Pandas library
● 001 - Introduction
● 002 - Sessions
● 003 - Amazon S3
● 004 - Parquet Datasets
● 005 - Glue Catalog
● 006 - Amazon Athena
● 007 - Databases (Redshift, MySQL, PostgreSQL)
● 008 - Redshift - Copy & Unload.ipynb
● 009 - Redshift - Append, Overwrite and Upsert
● 010 - Parquet Crawler
● 011 - CSV Datasets
● 012 - CSV Crawler
● 013 - Merging Datasets on S3
● 014 - Schema Evolution
● 015 - EMR
● 016 - EMR & Docker
● 017 - Partition Projection
● 018 - QuickSight
● 019 - Athena Cache
● 020 - Spark Table Interoperability
● 021 - Global Configurations
● 022 - Writing Partitions Concurrently
● 023 - Flexible Partitions Filter
● 024 - Athena Query Metadata
● 025 - Redshift - Loading Parquet files with Spectrum
● 026 - Amazon Timestream
● 027 - Amazon Timestream 2
● 028 - Amazon DynamoDB
● 029 - S3 Select
● 030 - Data Api
● 031 - OpenSearch
● 032 - Lake Formation Governed Tables
● 033 - Amazon Neptune
DEMOs!
Agenda
Updates from re:Invent 2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
Why Ray on AWS?
Used by Amazon.com for some data-intensive use cases
Better performance than Apache Spark in some cases
Customers are asking for unified Ray API for both data and AI/ML workloads
Serverless Ray on AWS
Scalable data transformations through Ray Datasets
Scalable AI and machine learning through Ray AI Runtime (AIR)
Serverless clusters through SageMaker + Glue Interactive Sessions integration
Which AWS services support Ray?
Scaling Ray from laptop to cluster
22
Ray Datasets - group by and count
23
Modin - group by and count
24
Apache Spark - group by and count
Ray use cases - data processing
Large-scale data ingest and transform
Change data capture
Distributed shuffle
Ray use cases - AI and machine learning
Fast “last-mile” data loading to improve model-training resources usage
Automated machine learning (AutoML) - find best model and tuning parameters
Hyper-parameter tuning - find best tuning parameters for a given model
Reinforcement learning - learn from repeated actions and results
Model-ensemble predictions
AWS SDK for Pandas uses Modin for distributed Pandas!
https://github.com/aws/aws-sdk-pandas/discussions/1815#common-errors <= Debugging and Performance
Lots of Ray tutorials
https://github.com/aws-samples/aws-samples-for-ray
https://docs.ray.io/en/latest/ray-core/examples/overview.html
DEMOs!
Scaling Pandas on AWS
Cheers!
Chris Fregly, Principal Solution Architect @ AWS
EXTRAS
High-memory instance types on AWS

Pandas on AWS - Let me count the ways.pdf

  • 1.
    Scaling Pandas onAWS Let me count the ways! Chris Fregly, Principal Solution Architect @ AWS
  • 2.
    Agenda Updates from re:Invent2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  • 3.
    Updates from AWSre:Invent 2022 (this week!) Focused on SageMaker usability, collaboration, and notebook-as-jobs First-class support for large language and generative models like Stable Diffusion Introduces Serverless Ray (ray.io) to AWS services including SageMaker and Glue
  • 4.
    Agenda Updates from re:Invent2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  • 5.
    Amazon Code Whisperer AWSSDK for Pandas Modin Pandas Spark Ray … Everything!
  • 6.
  • 7.
    Agenda Updates from re:Invent2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  • 8.
    Quick overview ofAWS services for data and AI/ML AI and machine learning Data and analytics
  • 9.
    Agenda Updates from re:Invent2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  • 10.
    Let’s search theinternet for “Pandas on AWS”
  • 11.
    AWS SDK forPandas - Python library featuring Modin! Captures AWS best practices including concurrent reads/writes and security Developed and maintained by AWS Professional Services and Solution Architects Allows Pandas to scale to the cloud!
  • 12.
    Why does scalematter? Big data is, well, big! Requires a lot of RAM to analyze Data that doesn’t fit into a single server’s RAM needs to run on a cluster Preferably a dedicated, serverless cluster Dedicated cluster avoids contention with other users/jobs (long-running) Serverless reduces cost because we pay for only what we use - less idle time Fortunately, AWS has innovated on fast-start (<1 sec) dedicated serverless clusters Based on Firecracker open source project https://firecracker-microvm.github.io
  • 13.
    Which AWS servicessupported by AWS SDK for Pandas? ● Amazon S3 ● AWS Glue Catalog ● Amazon Athena ● AWS Lake Formation ● Amazon Redshift ● PostgreSQL ● MySQL ● SQL Server ● Oracle ● Data API Redshift ● Data API RDS ● OpenSearch ● Amazon Neptune ● DynamoDB ● Amazon Timestream ● Amazon EMR ● Amazon CloudWatch Logs ● Amazon Chime ● Amazon QuickSight ● AWS STS ● AWS Secrets Manager ● Global Configurations
  • 14.
    Where can Irun the AWS SDK for Pandas library? Local laptop - limited by RAM AWS Glue notebooks/jobs, interactive serverless clusters Amazon SageMaker Studio notebooks/jobs, interactive serverless clusters thru Glue Amazon Elastic MapReduce (EMR) Studio notebooks/jobs - serverless clusters (2021)
  • 15.
    Lots of tutorialsfor the AWS SDK for Pandas library ● 001 - Introduction ● 002 - Sessions ● 003 - Amazon S3 ● 004 - Parquet Datasets ● 005 - Glue Catalog ● 006 - Amazon Athena ● 007 - Databases (Redshift, MySQL, PostgreSQL) ● 008 - Redshift - Copy & Unload.ipynb ● 009 - Redshift - Append, Overwrite and Upsert ● 010 - Parquet Crawler ● 011 - CSV Datasets ● 012 - CSV Crawler ● 013 - Merging Datasets on S3 ● 014 - Schema Evolution ● 015 - EMR ● 016 - EMR & Docker ● 017 - Partition Projection ● 018 - QuickSight ● 019 - Athena Cache ● 020 - Spark Table Interoperability ● 021 - Global Configurations ● 022 - Writing Partitions Concurrently ● 023 - Flexible Partitions Filter ● 024 - Athena Query Metadata ● 025 - Redshift - Loading Parquet files with Spectrum ● 026 - Amazon Timestream ● 027 - Amazon Timestream 2 ● 028 - Amazon DynamoDB ● 029 - S3 Select ● 030 - Data Api ● 031 - OpenSearch ● 032 - Lake Formation Governed Tables ● 033 - Amazon Neptune
  • 16.
  • 17.
    Agenda Updates from re:Invent2022 (this week!) Amazon Code Whisperer Overview of AWS services for data, AI and machine learning AWS SDK for Pandas Serverless Ray on AWS
  • 18.
    Why Ray onAWS? Used by Amazon.com for some data-intensive use cases Better performance than Apache Spark in some cases Customers are asking for unified Ray API for both data and AI/ML workloads
  • 19.
    Serverless Ray onAWS Scalable data transformations through Ray Datasets Scalable AI and machine learning through Ray AI Runtime (AIR) Serverless clusters through SageMaker + Glue Interactive Sessions integration
  • 20.
    Which AWS servicessupport Ray?
  • 21.
    Scaling Ray fromlaptop to cluster
  • 22.
    22 Ray Datasets -group by and count
  • 23.
    23 Modin - groupby and count
  • 24.
    24 Apache Spark -group by and count
  • 25.
    Ray use cases- data processing Large-scale data ingest and transform Change data capture Distributed shuffle
  • 26.
    Ray use cases- AI and machine learning Fast “last-mile” data loading to improve model-training resources usage Automated machine learning (AutoML) - find best model and tuning parameters Hyper-parameter tuning - find best tuning parameters for a given model Reinforcement learning - learn from repeated actions and results Model-ensemble predictions
  • 27.
    AWS SDK forPandas uses Modin for distributed Pandas! https://github.com/aws/aws-sdk-pandas/discussions/1815#common-errors <= Debugging and Performance
  • 28.
    Lots of Raytutorials https://github.com/aws-samples/aws-samples-for-ray https://docs.ray.io/en/latest/ray-core/examples/overview.html
  • 29.
  • 30.
    Scaling Pandas onAWS Cheers! Chris Fregly, Principal Solution Architect @ AWS
  • 31.
  • 32.