11. Infrastructure v1.0 - Problem
11
$ psql mole-redshift
> connection limit 500 exceeded for non-
bootstrap users
$ fO_o
Our team used to called this the “rainbow rangers”
15. Parquet Transformation
15
Btw, we are deploying
Lambda using Serverless.
Apache Parquet & Apache Arrow
1. Parquet: Columnar data on disk
2. Arrow: Columnar data on memory
Reference: https://arrow.apache.org/docs/python/parquet.html
16. Parquet Transformation
16
Write and
upload to S3
Read and use
Parquet with
pandas
Lambda Function Handler:
- Enrich, cleanse, or transform data with Pandas
- Write data back to S3
Reference: https://arrow.apache.org/docs/python/parquet.html
24. Goal: to enrich user experience with article recommendations
Pipelines
24
ETL: Video Data ETL: Article Data
ETL: User Engagement ETL: Data imports
M.L.: Article Topic Modeling
M.L.: User Reading Habits à Collaborative Filtering