We recently chose a data warehouse after doing a basic POC of some data warehouses - AWS Redshift, AWS Athena, Snowflake, Google BigQuery. In this slide I share what were some considerations unique to our business due to which we ended up choosing Snowflake and what were the pros and cons of the various warehouses.
2. What is Data Warehouse
Central Repository for all of your data
3. Why do we need a data warehouse
● Data in silos does not help us provide insights
● Single point of data access for the whole organization
4. Existing Solution
● Raw data present in AWS S3 in JSON format
● Data Scientist asked Engineering team for sample of data for making
machine learning models
● Product/Business asked Engineering team for insights from data
● Engineering used Apache spark to fulfill the requirements
So basically we didn’t have a data warehouse. Data was all in AWS S3 and we
fulfilled ad-hoc data requirements
5. Problem in the Existing Solution
● Engineering in the middle of everything
● Data Scientist needed data samples to create machine learning models
● Data scientist needed to do ad-hoc queries on all of data to make decisions
● Business needs ad-hoc access to the data
● Few people had know-how to do that. Not everyone was comfortable with
Apache Spark
● Engineering had machine configurations sorted out for few GBs of data. For
TBs we ran queries and wondered when they will complete. Sometimes they
did, sometimes they did not.
6. Usual business requirements
● Within the budget
● Should be able to handle the data volume
● Should be able to support N concurrent users
● Should be able to return queries in X time
● Should be able to scale
● etc.
7. Our additional business requirements
Must haves
● Easily accessible
● Minimize engineering effort/intervention needed
Good to haves
● A UI to interact. If not present, then instructions to set up some tool would
need to be evaluated. SQL is considered ok
● Be able to handle variable schema
● Less ongoing maintenance
8. Warehouses considered in order of preference
● AWS Redshift
● AWS Athena
● Snowflake
● Google BigQuery
9. Engineering Considerations
● Within the budget
● Should be able to handle the data volume
● Should be able to support N concurrent users
● Should be able to return queries in X time
● Should be able to scale
Our answers
● We will have to do calculations for each solution for the budget
● All of the solutions considered were designed to even handle much bigger
loads
10. Engineering Considerations contd.
● Easily accessible
● Minimize engineering effort/intervention needed
Our answers
● SQL was acceptable for access and all of them had access via SQL
● Taking engineering out of the picture was where we would need to spend
more time
11. Engineering Considerations contd.
● Being able to handle variable Schema (Biggest concern)
○ We do not have fixed schema in our data. It keeps on changing and will
keep on changing
○ How do we create initial tables?
○ How do we automate schema updates in case our raw data schema
changes?
12. Engineering Considerations contd.
● Capacity planning for the warehouse itself
○ Machines for storage
○ Machines for query speed
○ If we don't right size it then we may over pay
● Plan for Tuning and ongoing DBA work
○ capacity re-plan
○ tune columns, keys etc.
13. Engineering Considerations contd.
● Adding new data sources in future (JSON files with varying schema like our
data)
○ How much engineering intervention would be required if someone
wanted to add new data sources?
● Automating
○ insertions into the warehouse
○ changing data format from row to columnar
○ deleting from the warehouse
14. Engineering Considerations contd.
The more things that the warehouse can handle for us out of the box the better
because it reduces engineering intervention
To be considered
● Schema
● Capacity Planning
● Ongoing DBA work
● Automating (making sure all data is inserted and none is duplicated)
● Adding new data sources
I am not talking about budget calculations
15. Usage models
● AWS Redshift
○ Pay for machine size
● AWS Athena/Google Bigquery
○ Pay for storage
○ Pay for amount of data scanned
● Snowflake
○ Pay for storage
○ Pay for compute used
16. AWS Redshift
Pros
● Adding new data sources - Single file Low effort
Cons
● Capacity Planning is needed
● Ongoing DBA work is needed
● Adding new data sources - Multiple files large effort
No Good Solution - Schema and automating
17. AWS Athena
Pros
● No capacity planning needed
Cons
● Ongoing DBA work is needed
● Adding new data sources - Single file medium effort, Multiple files large effort
No Good Solution - Schema and automating
18. Snowflake
Pros
● Very solution for variable schema -VARIANT data type
● No capacity planning required
● No to Low DBA work needed
● Adding new data sources - Single file medium effort, Multiple files large effort
● Great solution for automating
Cons
● Some work for the schema was still required but not a lot