The document presents updates from AWS re:Invent 2022, focusing on enhancements to the AWS SDK for Pandas, including serverless capabilities and large language model support. It highlights the scalability of data processing and AI/ML workloads using AWS services like Amazon S3, Glue, SageMaker, and the integration of Ray. Additionally, it discusses tutorials and best practices for utilizing the SDK to manage big data efficiently in cloud environments.
Scaling Pandas onAWS
Let me count the ways!
Chris Fregly, Principal Solution Architect @ AWS
2.
Agenda
Updates from re:Invent2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
3.
Updates from AWSre:Invent 2022 (this week!)
Focused on SageMaker usability, collaboration, and notebook-as-jobs
First-class support for large language and generative models like Stable Diffusion
Introduces Serverless Ray (ray.io) to AWS services including SageMaker and Glue
4.
Agenda
Updates from re:Invent2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
Agenda
Updates from re:Invent2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
8.
Quick overview ofAWS services for data and AI/ML
AI and machine learning
Data and analytics
9.
Agenda
Updates from re:Invent2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
AWS SDK forPandas - Python library featuring Modin!
Captures AWS best practices including concurrent reads/writes and security
Developed and maintained by AWS Professional Services and Solution Architects
Allows Pandas to scale to the cloud!
12.
Why does scalematter?
Big data is, well, big! Requires a lot of RAM to analyze
Data that doesn’t fit into a single server’s RAM needs to run on a cluster
Preferably a dedicated, serverless cluster
Dedicated cluster avoids contention with other users/jobs (long-running)
Serverless reduces cost because we pay for only what we use - less idle time
Fortunately, AWS has innovated on fast-start (<1 sec) dedicated serverless clusters
Based on Firecracker open source project
https://firecracker-microvm.github.io
13.
Which AWS servicessupported by AWS SDK for Pandas?
● Amazon S3
● AWS Glue Catalog
● Amazon Athena
● AWS Lake Formation
● Amazon Redshift
● PostgreSQL
● MySQL
● SQL Server
● Oracle
● Data API Redshift
● Data API RDS
● OpenSearch
● Amazon Neptune
● DynamoDB
● Amazon Timestream
● Amazon EMR
● Amazon CloudWatch Logs
● Amazon Chime
● Amazon QuickSight
● AWS STS
● AWS Secrets Manager
● Global Configurations
14.
Where can Irun the AWS SDK for Pandas library?
Local laptop - limited by RAM
AWS Glue notebooks/jobs, interactive serverless clusters
Amazon SageMaker Studio notebooks/jobs, interactive serverless clusters thru Glue
Amazon Elastic MapReduce (EMR) Studio notebooks/jobs - serverless clusters (2021)
Agenda
Updates from re:Invent2022 (this week!)
Amazon Code Whisperer
Overview of AWS services for data, AI and machine learning
AWS SDK for Pandas
Serverless Ray on AWS
18.
Why Ray onAWS?
Used by Amazon.com for some data-intensive use cases
Better performance than Apache Spark in some cases
Customers are asking for unified Ray API for both data and AI/ML workloads
19.
Serverless Ray onAWS
Scalable data transformations through Ray Datasets
Scalable AI and machine learning through Ray AI Runtime (AIR)
Serverless clusters through SageMaker + Glue Interactive Sessions integration
Ray use cases- data processing
Large-scale data ingest and transform
Change data capture
Distributed shuffle
26.
Ray use cases- AI and machine learning
Fast “last-mile” data loading to improve model-training resources usage
Automated machine learning (AutoML) - find best model and tuning parameters
Hyper-parameter tuning - find best tuning parameters for a given model
Reinforcement learning - learn from repeated actions and results
Model-ensemble predictions
27.
AWS SDK forPandas uses Modin for distributed Pandas!
https://github.com/aws/aws-sdk-pandas/discussions/1815#common-errors <= Debugging and Performance
28.
Lots of Raytutorials
https://github.com/aws-samples/aws-samples-for-ray
https://docs.ray.io/en/latest/ray-core/examples/overview.html