From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python

© 2023 Snowflake Inc. All Rights Reserved
FROM RAW DATA TO
INTERACTIVE DATA APP!
Powered by Snowpark Python

Challenges in Developing Data Pipelines
- Troubleshooting & debugging failed jobs
- Multi-page stack trace
- Capacity management & resource sizing
- Setting up Infrastructure and Configs
- Executor memory
- Driver memory
- # of executors
- Z-ordering, V-ordering, ABC-ordering
- Partitioning, Bucketing, Salting

Challenges in Developing Data Pipelines
- Troubleshooting a failed spark job
- Executor memory
- Driver memory
- # of executors

Challenges Today
- Troubleshooting a failed spark job
- Executor memory
- Driver memory
- # of executors

ENTER SNOWPARK

Snowpark for Python
PYTHON • JAVA • SCALA
UDFs Stored Procedures
CLIENT SIDE
LIBRARIES
SERVER SIDE
RUNTIMES
Warehouses (Standard & Snowpark-Optimized)
DataFrame API

Snowpark: Secure Deployment
& Processing Of Non-SQL Code
& more
Built-in Anaconda
Packages
Processing Engine
SQL Engine Python Secure
Sandbox
Snowflake Connector for Python
Object Serializer
Query Translator
@udf def detect_fraud()
Python Functions & SProcs
df.filter(df.state == ‘WA’)
DataFrameAPI
Python Bytecode
SQL Query
16
CLIENT SIDE
LIBRARIES
SERVER SIDE
RUNTIMES

DATA STREAMING
WITH
DYNAMIC TABLES

Streaming in Snowflake
BENEFITS: AFTER
Native support for streaming and continuous batch
data pipelines. Easy, declarative semantics and no
orchestration required, no infrastructure management
SIMPLIFIED PIPELINES
Streaming ingest as much as 50% cheaper than
files. Continuous incremental processing reduces
wasted compute
COST EFFECTIVE
Expanding ecosystem in the Data Cloud with consistent
and strong security, governance, and scalability
NATIVE TO DATA CLOUD
PAIN POINTS: BEFORE
Managing dependencies,
scheduling, and orchestration
COMPLEXITY
Rebuilding tables completely,
no incremental materialization
INEFFICIENCY
Brittle pipelines unable to react
to changes upstream
MANAGEABILITY

Streaming ≠ Instantaneous
1 sec 1+ minutes 6+ hours
TIME
Time between event creation and action
VALUE
Value
to
business
SUMMIT
OF NOW
PEAK OF
SOON AFTER
MOUNTAIN
OF WISDOM
VALLEY OF
IRRELEVANCE

Streaming ≠ Instantaneous
1 sec 1+ minutes 6+ hours
TIME
Time between event creation and action
COST
Cost
to
business
SUMMIT
OF NOW
PEAK OF
SOON AFTER
MOUNTAIN
OF WISDOM
VALLEY OF
IRRELEVANCE
HIGH COST,
LOW RETURN
LOW COST,
UNTAPPED POTENTIAL
20

Streaming Pipelines at a Glance
INGEST TRANSFORM DELIVER
STORAGE SCHEDULING PROCESSING GOVERNANCE
Apps &
Services
OLTP
IoT
Kafka
Rows
Files
Snowpipe
Auto-Ingest
& Streaming
Tables Dynamic Tables*
Sharing
Replication
Native Apps
Worksheets
Dashboards
Serving
Unload
Python, Java, Scala
SQL
In Dev Private Public* GA

Ingestion Options
COPY SNOWPIPE SNOWPIPE
STREAMING
Efficient bulk loading of files
Control your own
compute resources
Deterministic latency
Continuous ingestion of files
Serverless
Median latency ~30s
Near real-time ingestion
of rowsets
Client application needed
< 5s median latency
In Dev Private Public GA

SNOWPIPE: FILES & STREAMING
APPS & SERVICES
OLTP
BUSINESS
INTELLIGENCE
MACHINE
LEARNING
SHARING
COPY &
Snowpipe
Snowpipe Streaming*
& Kafka Connector
STREAMING
Rowsets
Kafka Topics
SNOWPIPE
• Designed for batched rowsets as files
• Auto-scaled ingestion (10M files/10TB per hr)
• Deduplication with file tracking
SNOWPIPE STREAMING
• For rowsets with variable arrival frequency:
insertRows()
• Focus on lower latency & cost
• Ordered ingestion within a channel
BATCH
Files

Streaming Use Cases
Use existing event hubs to source data
Flexible latency-cost profiles
Run transformations with all reference data
instead of just single row transforms (ELT & ETL)
Powered by Snowflake apps
Add full power of Snowflake analytics from day 1
One place to query latest window of data &
full history + reference data
Proprietary (ISV-built) pipelines for
continuous analysis
CONNECTORS KAFKA / KINESIS SOURCES
SECURITY & LOG ANALYTICS
Aggregated logs from devices
No need to add event hubs if not needed
Simple post-ingestion cleanup
IOT / DEVICE LOGS
Ingest CDC streams with lower latency
Ensure exactly once semantics
Sourced from OLTP DBs, SaaS apps
Serverless so no clusters / stages to manage

Dynamic Tables Overview
CREATE DYNAMIC TABLE <name>
TARGET_LAG = <duration>
WAREHOUSE = <warehouse_name>
AS <select>
SELECT * FROM <name>
Store Results
Automatic Refreshes
Any Query!
NEW TABLE TYPE THAT
AUTOMATICALLY AND CONTINUOUSLY
MATERIALIZES THE RESULTS OF A QUERY

Dynamic Tables Overview
CONSISTENTLY
FAST TO QUERY
Immediate results
Freshness within LAG
Snapshot isolation
CREATE DYNAMIC TABLE <name>
TARGET_LAG = <duration>
WAREHOUSE = <warehouse_name>
AS <select>
SELECT * FROM <name>

Key Features
DECLARATIVE
DATA PIPELINES
Continuous data pipelines as easy
as SELECT. Complex pipelines with
hundreds of branches. Dynamic Tables
manage the scheduling and orchestration.
SQL
SUPPORT
Use any core SQL syntax to define
transformations, including joins,
unions, aggregations, window
functions, group bys, filters, etc.
USER-DEFINED
FRESHNESS
Controlled by a target lag for each
table, for sake of reduced cost and
improved performance. Data freshness
as low as 1 minute.
AUTOMATIC INCREMENTAL
REFRESHES
Refresh only what's changed, even for
complex queries, automatically (yes,
including UPDATEs and DELETEs!).
SNAPSHOT
ISOLATION
All Dynamic Tables in a DAG are
refreshed consistently from aligned
snapshots.

FULL STACK
DATA ENGINEERING
WITH
SNOWPARK

Full Stack DE with Python

Let’s build a Data App!
Ad Spend Optimizer for Ski Gear Co.

THANK YOU!

From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python

More Related Content

What's hot

Similar to From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python

More from HostedbyConfluent

Recently uploaded

From Raw Data to an Interactive Data App in an Hour: Powered by Snowpark Python