Systems that deal with AWS S3 often come with a negative performance impact. There's no co-location and the data has to move through slower, often congested wire networks. Alluxio can provide a caching layer for the data, however there's still the question of how and when to move which data. Should all the data by default be cached or should they be cached when used? In this talk, I will explore that gray area in between where the users and the dataset publishers will collaborate to decide what and how the data is cache in a tiered-storage architecture to maximize performance and minimize operating costs.