The session will be a deep dive introduction to Snowflake that includes Snowflake architecture, Virtual Warehouses, Designing a real use case, Loading data into Snowflake from a Data Lake.
Tata AIG General Insurance Company - Insurer Innovation Award 2024
KSnow: Getting started with Snowflake
1. Presented By: Sarfaraz Hussain
Sr. Software Consultant
Knoldus Inc
KSnow: Getting started with Snowflake
(A cloud data warehouse)
2. Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Silent Mode
Keep your screen on mute, until it
is necessary.
Avoid Distraction
Be along with the presenter during
the session and enjoy.
3. Agenda
01 Prerequisite Knowledge
02 Snowflake and it’s internal Architecture
03 Snowflake vs. Big Data Tools
04 Virtual Warehouse and Staging Area
05 Deep Dive into working Architecture
06 Use-case and DEMO
6. What is Data Warehouse?
2013 2014 201 2016 2017 2018
- DWH is a centralized place to store large amount of historical data produced by a
system/organization to find out meaningful insights after processing and analyzing the data.
- Traditional Data Warehouse Architecture:
7. What is Snowflake?
2013 2014 201 2016 2017 2018
Snowflake is modern-day data processing system that is intended to make the best
use of the elasticity of the cloud so that it can scale to infinity.
Features:
- Cloud based data warehouse
- SaaS solution
- Pay per Use model (storage + compute)
- Supports standard ANSI SQL
- Supports ODBC and JDBC connectors
- Auto Scalable and Elastic (Virtual Warehouse)
- Unlimited storage of data (Uses AWS S3, Azure Blob Storage, Google Cloud Storage)
8. Snowflake (contd.)
2013 2014 201 2016 2017 2018
Advantages:
- Easy to process huge volume of data
- Provides ACID transaction
- No data backups required
- No need to worry about Optimization
- No need to maintain Indexes
- No Out of Memory issues
- Sharing data
Disadvantage:
- COST
9. Snowflake vs. Big Data Tools
2013 2014 201 2016 2017 2018
Apache Hive
- It is a data warehouse on top of HDFS
- It has performance challenges as it uses MapReduce for processing
Apache Spark (Batch SQL processing)
- Spark SQL has limited support for advanced SQL operation
- Advance optimizations are developer’s responsibility
- Resource allocation is developer’s responsibility
12. Data Storage Layer
2013 2014 201 2016 2017 2018
- When we create a Snowflake account, we select the underlying cloud provider.
- Cloud provider can be AWS, Azure, Google.
- According to our choice, the Data Storage Layer (DSL) is hosted on AWS S3, Azure Blob Storage
or Google Cloud Storage.
- DSL stores the actual data and provides unlimited space.
- Data in the DSL is stored as compressed columnar format using AES 256-bit encryption.
13. Virtual Warehouse
2013 2014 201 2016 2017 2018
- Virtual Warehouse are cluster of nodes that process the data.
- In case of AWS, these nodes are EC2 instances and accordingly for Azure and Google.
- Computation/processing is performed by Virtual Warehouse which helps in loading and querying of
data.
- It does not store the data and can be suspended when not in use.
- Suspended virtual warehouse can automatically resume upon running query.
- It can cache the query result.
- Size of virtual warehouse can be scaled up or down (manual process).
- Elastic or Multi cluster virtual warehouse - can replicate multiple virtual warehouse of the same
size depending upon the workload (automatic process)
- WHEN TO SCALE UP AND DOWN CLUSTER?
14. Scaling Policy
2013 2014 201 2016 2017 2018
- How many queries does Snowflake queues before it spins up additional cluster?
- STANDARD: Immediately when a query is queued, i.e. when the system detects that there is
one more query than the currently running cluster can execute.
- ECONOMY: Only if the system estimates there is enough query load to keep the new cluster
busy for at least 6 minutes.
15. Virtual Warehouse Size
2013 2014 201 2016 2017 2018
Size X-Small Small Medium Large X-Large 2X-Large 3X-Large 4X-Large
No. of
nodes
1 2 4 8 16 32 64 128
21. Deep Dive in Architecture
2013 2014 201 2016 2017 2018
22. Staging Area
2013 2014 201 2016 2017 2018
“Stages” or “Staging Areas” are places to put things temporarily before moving them to a
more stable location.
23. Staging Area (contd.)
2013 2014 201 2016 2017 2018
- External storage from where data is loaded in Snowflake’s Data Storage Layer.
- External storage can be AWS S3, Azure Blob Storage, Google Cloud Storage.
- It is treated as Data Lake where land first lands into.
- From staging area we load data into Snowflake database, after performing transformations if
required.
- To load batch data:
Snowflake’s COPY command, Informatica, Talend, Matillion
- To load continuous data:
Snowpipe, Kafka, Kinesis