Pandas on AWS - Let me count the ways.pdf

Scaling Pandas on AWS
Let me count the ways!
Chris Fregly, Principal Solution Architect @ AWS

Agenda
Updates from re:Invent 2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS

Updates from AWS re:Invent 2022 (this week!)
Focused on SageMaker usability, collaboration, and notebook-as-jobs
First-class support for large language and generative models like Stable Diffusion
Introduces Serverless Ray (ray.io) to AWS services including SageMaker and Glue

Amazon Code Whisperer
AWS SDK for Pandas
Modin
Pandas
Spark
Ray
…
Everything!

Quick overview of AWS services for data and AI/ML
AI and machine learning
Data and analytics

Let’s search the internet for “Pandas on AWS”

AWS SDK for Pandas - Python library featuring Modin!
Captures AWS best practices including concurrent reads/writes and security
Developed and maintained by AWS Professional Services and Solution Architects
Allows Pandas to scale to the cloud!

Why does scale matter?
Big data is, well, big! Requires a lot of RAM to analyze
Data that doesn’t fit into a single server’s RAM needs to run on a cluster
Preferably a dedicated, serverless cluster
Dedicated cluster avoids contention with other users/jobs (long-running)
Serverless reduces cost because we pay for only what we use - less idle time
Fortunately, AWS has innovated on fast-start (<1 sec) dedicated serverless clusters
Based on Firecracker open source project
https://firecracker-microvm.github.io

Which AWS services supported by AWS SDK for Pandas?
● Amazon S3
● AWS Glue Catalog
● Amazon Athena
● AWS Lake Formation
● Amazon Redshift
● PostgreSQL
● MySQL
● SQL Server
● Oracle
● Data API Redshift
● Data API RDS
● OpenSearch
● Amazon Neptune
● DynamoDB
● Amazon Timestream
● Amazon EMR
● Amazon CloudWatch Logs
● Amazon Chime
● Amazon QuickSight
● AWS STS
● AWS Secrets Manager
● Global Configurations

Where can I run the AWS SDK for Pandas library?
Local laptop - limited by RAM
AWS Glue notebooks/jobs, interactive serverless clusters
Amazon SageMaker Studio notebooks/jobs, interactive serverless clusters thru Glue
Amazon Elastic MapReduce (EMR) Studio notebooks/jobs - serverless clusters (2021)

Lots of tutorials for the AWS SDK for Pandas library
● 001 - Introduction
● 002 - Sessions
● 003 - Amazon S3
● 004 - Parquet Datasets
● 005 - Glue Catalog
● 006 - Amazon Athena
● 007 - Databases (Redshift, MySQL, PostgreSQL)
● 008 - Redshift - Copy & Unload.ipynb
● 009 - Redshift - Append, Overwrite and Upsert
● 010 - Parquet Crawler
● 011 - CSV Datasets
● 012 - CSV Crawler
● 013 - Merging Datasets on S3
● 014 - Schema Evolution
● 015 - EMR
● 016 - EMR & Docker
● 017 - Partition Projection
● 018 - QuickSight
● 019 - Athena Cache
● 020 - Spark Table Interoperability
● 021 - Global Configurations
● 022 - Writing Partitions Concurrently
● 023 - Flexible Partitions Filter
● 024 - Athena Query Metadata
● 025 - Redshift - Loading Parquet files with Spectrum
● 026 - Amazon Timestream
● 027 - Amazon Timestream 2
● 028 - Amazon DynamoDB
● 029 - S3 Select
● 030 - Data Api
● 031 - OpenSearch
● 032 - Lake Formation Governed Tables
● 033 - Amazon Neptune

Why Ray on AWS?
Used by Amazon.com for some data-intensive use cases
Better performance than Apache Spark in some cases
Customers are asking for unified Ray API for both data and AI/ML workloads

Serverless Ray on AWS
Scalable data transformations through Ray Datasets
Scalable AI and machine learning through Ray AI Runtime (AIR)
Serverless clusters through SageMaker + Glue Interactive Sessions integration

Which AWS services support Ray?

Scaling Ray from laptop to cluster

22
Ray Datasets - group by and count

24
Apache Spark - group by and count

Ray use cases - data processing
Large-scale data ingest and transform
Change data capture
Distributed shuffle

Ray use cases - AI and machine learning
Fast “last-mile” data loading to improve model-training resources usage
Automated machine learning (AutoML) - find best model and tuning parameters
Hyper-parameter tuning - find best tuning parameters for a given model
Reinforcement learning - learn from repeated actions and results
Model-ensemble predictions

AWS SDK for Pandas uses Modin for distributed Pandas!
https://github.com/aws/aws-sdk-pandas/discussions/1815#common-errors <= Debugging and Performance

Lots of Ray tutorials
https://github.com/aws-samples/aws-samples-for-ray
https://docs.ray.io/en/latest/ray-core/examples/overview.html

Scaling Pandas on AWS
Cheers!
Chris Fregly, Principal Solution Architect @ AWS

High-memory instance types on AWS

Pandas on AWS - Let me count the ways.pdf

In this document

More Related Content

Similar to Pandas on AWS - Let me count the ways.pdf

More from Chris Fregly

Recently uploaded

Pandas on AWS - Let me count the ways.pdf