© 2021 Snowflake Inc. All Rights Reserved
ACTIONABLE INSIGHTS MIT AI - VOM
EXPERIMENT ZUR WERTSCHÖPFUNG
26. Jan. 2021
Harald Erb | harald.erb@snowflake.com
Sr. Solutions Engineer, Central Europe
© 2021 Snowflake Computing Inc. All Rights Reserved
ABOUT ME
Sr. Solutions Engineer
Central Europe
harald.erb@snowflake.com
Llinkedin.com/in/haralderb
Enthusiastic about Business Analytics &
Data Management for 20+ years
> Consulting: Delivered large-scale Data
Warehouse and BI projects as Developer,
Information Analyst, Solution Architect,
Project Lead at Oracle D/A/CH
> Presales: 2nd SE on the ground at Snowflake in
Centr. Europe with focus on Data Management,
Business Analytics & Data Science
> Worked with clients on Big Data & IoT solutions
as Architect and Solutions Engineer at Oracle
EMEA, Pentaho and Hitachi Vantara
© 2021 Snowflake Inc. All Rights Reserved
DYI ?
3
Kubernetes Cluster with 5 Raspberry PIs
???
Fascinating technology, but unfortunately
there is not enough time for DYI...
© 2021 Snowflake Inc. All Rights Reserved
AGENDA – PART 1
4
Source: github.com/szilard/ml-prod (Dr. Szilard Pafka)
ML LIFECYCLE
Tools & Ecosystem
• Notebooks (SQL, Python)
• Snowflake: Snowpark (Scala)
Data Development
• Required Platform capabilities
• SQL
• Snowflake: TimeTravel,
Zero-copy clone
Onboarding of new
Datasets
• Data Lake integration
• API integration
• Snowflake Data
Marketplace
Experiment
(Lab environment)
Value added
(ML in Production,
at scale)
© 2021 Snowflake Inc. All Rights Reserved.
SNOWFLAKE DATA ARCHITECTURE FOR
DATA SCIENCE
© 2021 Snowflake Computing Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
“DATA SCIENCE ONLY” ENVIRONMENT
6
© 2021 Snowflake Inc. All Rights Reserved
DATA SCIENCE + REPORTING DATABASE
7
© 2021 Snowflake Inc. All Rights Reserved
SNOWFLAKE DATA ARCHITECTURE
8
“Data Lake inside”
© 2021 Snowflake Computing Inc. All Rights Reserved
OPTIONS HOW TO ORGANIZE DATA ASSETS IN SNOWFLAKE
Data Sources Data Consumers
Structured Data
Semi-Structured Data
Web APIs
IoT Data
Data Visualization /
Reporting
Data Science
Ad hoc Queries
Data Zones in Snowflake
Work Area (Exploratory, AI / ML)
Persistent, user/team space, dedicated compute resources
Landing Zone
Transient, ELT processes, truncate/reload
Raw
Raw data, schema-
less (JSON…):
no transformations,
matches source data
Conformed
Raw +
de-duplicated, data
type standardization
(dates)
Reference
Master data, ,
manual mappings,
Business hierarchies
Modeled
Integrated, cleansed,
modeled data (3NF,
Data Vault,
Dimensional Model)
“Data Lake" “Data Warehouse”
9
Snowflake’s Architecture is based on elastic cloud storage allowing to organize very large amounts of raw data at an affordable
price. This capability enables Data Teams to perform unbounded data discovery and data understanding while Analysts can
access business friendly data models in a self-service mode.
© 2021 Snowflake Computing Inc. All Rights Reserved 10
> SELECT … FROM
…
Semi-structured data
(JSON, Avro, XML, Parquet, ORC)
Structured data
(e.g., CSV, TSV, …)
Storage optimization
Transparent discovery and storage optimization
of repeated elements
Query optimization
Full database optimization for queries
on semi-structured data
+
select v:lastName::string
as last_name
, ...
from json_doc_table;
HANDLING OF SCHEMALESS DATA IN SNOWFLAKE
With Snowflake’s VARIANT data type, semi-structured data can be loaded easily into a relational DW
and is then available for immediate analysis
© 2021 Snowflake Inc. All Rights Reserved
SNOWFLAKE DATA ARCHITECTURE
11
Storage Integration with external Data Lake
© 2021 Snowflake Inc. All Rights Reserved.
INGESTION & PROCESSING
OF NEW DATASETS
© 2021 Snowflake Computing Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
OPTIONS FOR DATA INGESTION
FILES
AUTO-INGEST
SNOWPIPE
SNOWPIPE
REST API
COPY
Driverless
Notification-driven
Serverless
Async, Continuous
File Dedup
Error Handling
EXTERNAL
TABLE AUTO
REFRESH
Bulk Loading
● COPY Command
● User-managed compute
resource
Continuous Loading
● Snowpipe
● Snowflake-managed
compute resource
© 2021 Snowflake Inc. All Rights Reserved
© 2020 Snowflake Inc. All Rights Reserved
SOLUTION SCENARIO
SCENARIO: INGESTING FUEL PRICE DATA FOR ANALYIS
Source: tankerkoenig.de
© 2021 Snowflake Inc. All Rights Reserved
SOLUTION
ARCHITECTURE
© 2020 Snowflake Inc. All Rights Reserved
In Scope today
© 2021 Snowflake Inc. All Rights Reserved 16
Key Steps
>Integrate with AWS S3 and connect
Snowflake via External Stage
>Create a Pipe for Automatic Data Ingestion
> Run Snowpipe with new data
EXAMPLE
DATA INGESTION WITH
SNOWPIPE
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
AWS S3: STORAGE INTEGRATION VIA EXTERNAL STAGE
v
SF Admin Task, typically
not done by developers!
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
AWS S3: DATA ACCESS DIRECTLY FROM SNOWFLAKE
v
List content of a S3 bucket directly
from Snowflake, navigate subfolder
structure.
Identify, inspect and select files to be
loaded using “ * ” and RegExp etc.
Compute statistics
on files to be loaded
into Snowflake
© 2021 Snowflake Inc. All Rights Reserved
NOTIFICATION-DRIVEN DATA INGESTION WITH SNOWPIPE
v
v
Bulk load
command
v
Target table to be updated
Source location,
external stage
(e.g. S3 Bucket)
© 2021 Snowflake Inc. All Rights Reserved 20
Key Steps
>Integrate AWS Lambda Function
>Automate API Calls + store Payloads (JSON)
> Implement Change Data Capture
> Automate JSON flattening + Data Loading
EXAMPLE
FAST INTEGRATION OF
API PAYLOADS
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
API INTEGRATION VIA EXTERNAL FUNCTION
SF Admin Task, typically
not done by developers
v
v
External API’s can now
be called via SQL!
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
EXTERNAL API CALL VIA SQL (EXAMPLE 1)
Table with column
SOURCE_DATA (VARIANT
data type) containing a
collection of raw datasets from
multiple open data sources
v
Insert Statement
calling external API
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
EXTERNAL API CALL VIA SQL (EXAMPLE 2)
Fuel price data of multiple
gas stations
v
Insert Statement
calling external API
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
AUTOMATED DELTA LOAD WITH STREAMS AND TASKS
V
Task will only start if table stream
has new data records to process
à saves compute resources!
Only CDC data
records of interest will
be processed and then
cleared from stream
when committed
V
Lateral view and flatten table function
used to split price data by Gas Station
and store as separate records in the
target table REMOTE_FUEL_PRICES
V
© 2021 Snowflake Inc. All Rights Reserved
API DATA IS NOW PREPARED FOR FURTHER USE…
New fuel prices prepared
and stored in target table
REMOTE_FUEL_PRICES
(still in JSON format)
V
© 2021 Snowflake Inc. All Rights Reserved
…LIKE IN A DASHBOARD COMBINING HISTORICAL + RT DATA
© 2021 Snowflake Inc. All Rights Reserved
V
Final BI Query:
Reading, formatting and joining
JSON price data directly with
master data – with fast query
performance
© 2021 Snowflake Inc. All Rights Reserved 27
Key Steps
>Identify & select dataset of interest in
Snowflake Data Marketplace
>Request new (paid) dataset and agree on
terms and conditions
>Once request is approved, access to the
Shared Database will be granted by Data
Provider
> Query and blend own and shared data for
new insights
EXAMPLE
SNOWFLAKE DATA
MARKETPLACE
© 2021 Snowflake Inc. All Rights Reserved
PUBLIC AND PERSONALIZED DATASETS…
28
Step 1: Select
dataset of interest
V
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
…AVAILABLE BY (PAID) REQUEST…
29
Step 2: Request
shared dataset
for access in own
Snowflake Account
© 2021 Snowflake Inc. All Rights Reserved
…READY TO QUERY – DATA SCIENTISTS LOVE IT ALREADY!
30
Shared Database is
visible (read-only)
in own Snowflake
Account – no
additional ETL
required
V
V
Step 3: Query location
features to improve
predictive models
using Snowflake’s Geo
SQL Functions
© 2021 Snowflake Inc. All Rights Reserved.
MORE DATA ENGINEERING
WITH SNOWFLAKE
© 2021 Snowflake Computing Inc. All Rights Reserved
© 2021 Snowflake Computing Inc. All Rights Reserved
PLATFORM REQUIREMENT #1: DEDICATED COMPUTE
32
Continuous
Loading (4TB/day)
S3
<5min SLA
Compute Cluster
“Medium”
Batch Data Loads
& Transformations
Compute Cluster
"Large”
Interactive
Dashboard
50% < 1s
85% < 2s
95% < 5s
Compute Cluster
Auto Scale –
”X-Large” x 5
Prod DB
Snowflake Shared Data, Multi-Cluster Architecture: All data available in a central repository,
all major workloads isolated, elastic compute provides performance on demand
Structured & Semi-structured Data at Petabyte-Scale
(all encrypted, compressed)
Dedicated Compute
Cluster incl.
allowance to resize,
e.g. up to
"2X-Large” etc.
Data Science /
Data Projects
© 2021 Snowflake Inc. All Rights Reserved 33
PLATFORM REQUIREMENT #2: ELASTIC SCALING IN NO TIME
Vertical Scaling: Resize Compute Cluster instantly
• Pure Cloud User Experience
• Scale up/down in no time, no need to start/stop, or reboot
• Benefit: Faster data loads, more complex queries
(aggregations, new features)
Horizontal Scaling: Automatic Scale-out
• Snowflake detects/handles massive parallel workloads
and automatically scales back after the load drops
• Benefit: constantly good performance, even in peak times
© 2021 Snowflake Computing Inc. All Rights Reserved
PLATFORM REQUIREMENT #3: PAY FOR WHAT YOU USE
ETL and
Processing
Morning Noon Night
Workload
Reporting
Ad-hoc
Analytics
Morning Noon Night
Workload
Morning Noon Night
Workload
Data
Science
Morning Noon Night
Workload
© 2021 Snowflake Inc. All Rights Reserved
MEETING USERS WHERE THEY ARE
35
© 2021 Snowflake Inc. All Rights Reserved
DEVELOPER EXPERIENCE WITH SNOWFLAKE
36
Coding:
Data Pipelines, Debugging –
e.g. with VS Code, Scala and
Snowpark (in Development)
RDBMS
Development:
Database models,
use data dictionary,
RBAC – e.g. with
Snowsight,
DBeaver, SqlDBM
Data Science
Notebook: Live code
embedded in markup
text + visualizations –
e.g. with Jupyter and
SF Python Connector
© 2021 Snowflake Computing Inc. All Rights Reserved 37
DON’T FORGET SQL FOR FEATURE ENGINEERING
Create Compute-intensive Features in Snowflake
à Eliminates need for frameworks such as Featuretools for Python
Examples
• Calculate number of sales per product for last week/month/year
• Rank products within different product categories by price, revenue, etc.
• Calculate share of revenue or margin for a given product
• Find the lowest ever sales price for all products in each product category
Real code example, but simplified for increased readability
Adopted from Elkjøp’s presentation @ Snowflake Stockholm Meetup, 2021
© 2021 Snowflake Computing Inc. All Rights Reserved 38
SNOWFLAKE FEATURES FOR REPRODUCIBILITY
TIME TRAVEL
All tables can be versioned for 0 - 90 days
• Extremely useful for tracking changes and
debugging master data issues
à Eliminates need to maintain separate history
tables in many cases
• Intuitive SQL syntax for querying tables at a
specific point in time
ZERO-COPY CLONING
Take “snapshots” of any table, schema or database
at no extra cost
• Useful for creating static datasets when developing
ML models à eliminates need to create datasets in
Azure Blob Storage / AWS S3
• Associate Git commit (code version) with cloned
table (data version) à makes it easier to
reproduce an experiment in the future
Real code example, but
simplified for increased
readability
Adopted from Elkjøp’s presentation @ Snowflake Stockholm Meetup, 2021
© 2021 Snowflake Computing Inc. All Rights Reserved 39
REPORT OF A CUSTOMER’S DATA SCIENTIST TEAM
© 2021 Snowflake Inc. All Rights Reserved
DATA DEVELOPMENT: PYTHON & SQL COMBINED
41
à Notebook
© 2021 Snowflake Inc. All Rights Reserved
DATA DEVELOPMENT: SCALA VIA SNOWPARK
42
SNOWPARK – A new developer experience that allows you to write Snowflake code in
your preferred way, and execute it directly in Snowflake (incl. all platform benefits)
IN PRIVATE-PREVIEW
© 2021 Snowflake Inc. All Rights Reserved
DATA DEVELOPMENT: SCALA VIA SNOWPARK
43
A comparative example
SQL Snowpark
Write, use functions.
Monolithic, hard to debug
Pipeline, easy to debug
Repetitive, inflexible
operations
Flexible, reusable
operations
IN PRIVATE-PREVIEW
© 2021 Snowflake Inc. All Rights Reserved
DATA DEVELOPMENT: SCALA VIA SNOWPARK
44
Deep Dive and Demo of Preview Version
Webinar Recording: Building Open Source Data Science Models on Snowflake
IN PRIVATE-PREVIEW
© 2021 Snowflake Inc. All Rights Reserved
AGENDA – PART 2
45
Source: github.com/szilard/ml-prod (Dr. Szilard Pafka)
ML LIFECYCLE
Data Science Platform (DataRobot)
• Auto ML
• ML Ops
• Model Deployment
Experiment
(Lab environment)
Value added
(ML in Production,
at scale)
© 2021 Snowflake Inc. All Rights Reserved.
DATA SCIENCE (AUTO ML) PLATFORMS TO
COMPLEMENT SNOWFLAKE CAPABILITIES
© 2021 Snowflake Computing Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
DATA SCIENCE PLATFORM – ARCHITECTURE (DATAROBOT)
47
© 2021 Snowflake Inc. All Rights Reserved
DATA SCIENCE PLATFORM – COMPONENTS (DATAROBOT)
48
© 2021 Snowflake Inc. All Rights Reserved
AUTO ML – LEADERBOARD OF A PROJECT
49
à DataRobot
© 2021 Snowflake Inc. All Rights Reserved
ML OPERATIONS
50
Any ML Ops solution capable and agile enough to handle complex situations has to be proficient in the following four
critical areas to safely deliver machine learning applications in production at scale
© 2021 Snowflake Inc. All Rights Reserved
ML OPS – MONITORING (SERVICE HEALTH)
51
© 2021 Snowflake Inc. All Rights Reserved
ML OPS – MONITORING (MODEL ACCURACY)
52
© 2021 Snowflake Inc. All Rights Reserved
INTEGRATION OF PRODUCTIVE ML MODELS
53
© 2021 Snowflake Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
CODING EXAMPLE: DATAROBOT & SNOWFLAKE
54
Notebook example: Snowflake + DataRobot Prediction API
Call of DataRobot
prediction API
V
Sample code: DataRobot Github Repo
Open Snowflake cursor
and and store
prediction results in a
table for further use
© 2021 Snowflake Inc. All Rights Reserved
MODEL DEPLOYMENT – SOURCE CODE EXPORT
55
SNOWFLAKE INTEGRATION EXAMPLE
Detailed Article: DataRobot Blog
56
Calling a productive Machine Learning Model via Snowflake External Functions
© 2021 Snowflake Inc. All Rights Reserved.
SESSION TAKEAWAY
© 2021 Snowflake Computing Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
SNOWFLAKE DATA CLOUD
58
DATA
SOURCES
OLTP DATABASES
ENTERPRISE
APPLICATIONS
THIRD-PARTY
WEB/LOG DATA
IoT
DATA
CONSUMERS
DATA MONETIZATION
OPERATIONAL
REPORTING
AD HOC ANALYSIS
REAL-TIME ANALYTICS
© 2021 Snowflake Inc. All Rights Reserved
A COMPLETE & EASY-TO-USE DATA PLATFORM…
Structured Data
Semi-Structured Data
Web APIs
IoT Data
Visualization /
Reporting
Data Science
Ad hoc Queries
Data Sources Stage
Presentation /
Consumers
JSON, AVRO
(VARIANT)
Hive Metastore
Integration
External Tables
Parquet
Load/Unload
ANSI SQL
Data Lake Warehouse Aggregation
Semantic /
Federated
Elastic Multi-
Cluster Compute
Data Vault,
3NF Modeling
ACID
Transactional
Consistency
Data Masking/
Row-level Sec. *)
Sec. + Material.
Views
Zero Copy
Cloning
SSO
LDAP
OAUTH
SCIM
ODBC/JDBC
Python/R/Spark
Connector
End-to-End Security (RBAC, Encryption at Rest/in Motion)
Web UI /
Snowsight
External
Functions
Data Sharing /
Marketplace
Streams (CDC) &
Tasks (Scheduler)
Time Travel
Kafka-Connector /
Snowpipe
Stored Procs /
UDFs
Geospatial
Snowflake supports Data Lake, Data Warehouse, and Data Engineering workloads
Dimensional
Modeling
Information
Schema
59
SnowPark *)
(Scala)
*) Roadmap item / in Preview
SNOWFLAKE PLATFORM DATA ENGINEERING INTEGRATIONS
…FOR ACCELERATING MACHINE LEARNING & AI
● Integration with AutoML
platforms and Notebook-
based ML
● Read and writeback
● Spark Connector
● Python Connector
● Apache Arrow support
● Snowpipe
● Kafka
● Streams/Tasks
● Reproducibility – Versioning -
Features, Models
● “Share” data & results
● Feature Repository – Re-Use
● Time Travel
● Zero-copy cloning
• Data Platform as a Service
• Instant Scalability & Elasticity
• Dedicated Virtual Warehouses
• Store and use all data
regardless of structure
• Cross cloud replication
• Pay per use – per second pricing
• Data Marketplace & Exchange
60
© 2021 Snowflake Inc. All Rights Reserved.
Q & A
© 2021 Snowflake Computing Inc. All Rights Reserved
© 2021 Snowflake Inc. All Rights Reserved
THANK YOU

Actionable Insights with AI - Snowflake for Data Science

  • 1.
    © 2021 SnowflakeInc. All Rights Reserved ACTIONABLE INSIGHTS MIT AI - VOM EXPERIMENT ZUR WERTSCHÖPFUNG 26. Jan. 2021 Harald Erb | harald.erb@snowflake.com Sr. Solutions Engineer, Central Europe
  • 2.
    © 2021 SnowflakeComputing Inc. All Rights Reserved ABOUT ME Sr. Solutions Engineer Central Europe harald.erb@snowflake.com Llinkedin.com/in/haralderb Enthusiastic about Business Analytics & Data Management for 20+ years > Consulting: Delivered large-scale Data Warehouse and BI projects as Developer, Information Analyst, Solution Architect, Project Lead at Oracle D/A/CH > Presales: 2nd SE on the ground at Snowflake in Centr. Europe with focus on Data Management, Business Analytics & Data Science > Worked with clients on Big Data & IoT solutions as Architect and Solutions Engineer at Oracle EMEA, Pentaho and Hitachi Vantara
  • 3.
    © 2021 SnowflakeInc. All Rights Reserved DYI ? 3 Kubernetes Cluster with 5 Raspberry PIs ??? Fascinating technology, but unfortunately there is not enough time for DYI...
  • 4.
    © 2021 SnowflakeInc. All Rights Reserved AGENDA – PART 1 4 Source: github.com/szilard/ml-prod (Dr. Szilard Pafka) ML LIFECYCLE Tools & Ecosystem • Notebooks (SQL, Python) • Snowflake: Snowpark (Scala) Data Development • Required Platform capabilities • SQL • Snowflake: TimeTravel, Zero-copy clone Onboarding of new Datasets • Data Lake integration • API integration • Snowflake Data Marketplace Experiment (Lab environment) Value added (ML in Production, at scale)
  • 5.
    © 2021 SnowflakeInc. All Rights Reserved. SNOWFLAKE DATA ARCHITECTURE FOR DATA SCIENCE © 2021 Snowflake Computing Inc. All Rights Reserved
  • 6.
    © 2021 SnowflakeInc. All Rights Reserved “DATA SCIENCE ONLY” ENVIRONMENT 6
  • 7.
    © 2021 SnowflakeInc. All Rights Reserved DATA SCIENCE + REPORTING DATABASE 7
  • 8.
    © 2021 SnowflakeInc. All Rights Reserved SNOWFLAKE DATA ARCHITECTURE 8 “Data Lake inside”
  • 9.
    © 2021 SnowflakeComputing Inc. All Rights Reserved OPTIONS HOW TO ORGANIZE DATA ASSETS IN SNOWFLAKE Data Sources Data Consumers Structured Data Semi-Structured Data Web APIs IoT Data Data Visualization / Reporting Data Science Ad hoc Queries Data Zones in Snowflake Work Area (Exploratory, AI / ML) Persistent, user/team space, dedicated compute resources Landing Zone Transient, ELT processes, truncate/reload Raw Raw data, schema- less (JSON…): no transformations, matches source data Conformed Raw + de-duplicated, data type standardization (dates) Reference Master data, , manual mappings, Business hierarchies Modeled Integrated, cleansed, modeled data (3NF, Data Vault, Dimensional Model) “Data Lake" “Data Warehouse” 9 Snowflake’s Architecture is based on elastic cloud storage allowing to organize very large amounts of raw data at an affordable price. This capability enables Data Teams to perform unbounded data discovery and data understanding while Analysts can access business friendly data models in a self-service mode.
  • 10.
    © 2021 SnowflakeComputing Inc. All Rights Reserved 10 > SELECT … FROM … Semi-structured data (JSON, Avro, XML, Parquet, ORC) Structured data (e.g., CSV, TSV, …) Storage optimization Transparent discovery and storage optimization of repeated elements Query optimization Full database optimization for queries on semi-structured data + select v:lastName::string as last_name , ... from json_doc_table; HANDLING OF SCHEMALESS DATA IN SNOWFLAKE With Snowflake’s VARIANT data type, semi-structured data can be loaded easily into a relational DW and is then available for immediate analysis
  • 11.
    © 2021 SnowflakeInc. All Rights Reserved SNOWFLAKE DATA ARCHITECTURE 11 Storage Integration with external Data Lake
  • 12.
    © 2021 SnowflakeInc. All Rights Reserved. INGESTION & PROCESSING OF NEW DATASETS © 2021 Snowflake Computing Inc. All Rights Reserved
  • 13.
    © 2021 SnowflakeInc. All Rights Reserved OPTIONS FOR DATA INGESTION FILES AUTO-INGEST SNOWPIPE SNOWPIPE REST API COPY Driverless Notification-driven Serverless Async, Continuous File Dedup Error Handling EXTERNAL TABLE AUTO REFRESH Bulk Loading ● COPY Command ● User-managed compute resource Continuous Loading ● Snowpipe ● Snowflake-managed compute resource
  • 14.
    © 2021 SnowflakeInc. All Rights Reserved © 2020 Snowflake Inc. All Rights Reserved SOLUTION SCENARIO SCENARIO: INGESTING FUEL PRICE DATA FOR ANALYIS Source: tankerkoenig.de
  • 15.
    © 2021 SnowflakeInc. All Rights Reserved SOLUTION ARCHITECTURE © 2020 Snowflake Inc. All Rights Reserved In Scope today
  • 16.
    © 2021 SnowflakeInc. All Rights Reserved 16 Key Steps >Integrate with AWS S3 and connect Snowflake via External Stage >Create a Pipe for Automatic Data Ingestion > Run Snowpipe with new data EXAMPLE DATA INGESTION WITH SNOWPIPE
  • 17.
    © 2021 SnowflakeInc. All Rights Reserved © 2021 Snowflake Inc. All Rights Reserved AWS S3: STORAGE INTEGRATION VIA EXTERNAL STAGE v SF Admin Task, typically not done by developers!
  • 18.
    © 2021 SnowflakeInc. All Rights Reserved © 2021 Snowflake Inc. All Rights Reserved AWS S3: DATA ACCESS DIRECTLY FROM SNOWFLAKE v List content of a S3 bucket directly from Snowflake, navigate subfolder structure. Identify, inspect and select files to be loaded using “ * ” and RegExp etc. Compute statistics on files to be loaded into Snowflake
  • 19.
    © 2021 SnowflakeInc. All Rights Reserved NOTIFICATION-DRIVEN DATA INGESTION WITH SNOWPIPE v v Bulk load command v Target table to be updated Source location, external stage (e.g. S3 Bucket)
  • 20.
    © 2021 SnowflakeInc. All Rights Reserved 20 Key Steps >Integrate AWS Lambda Function >Automate API Calls + store Payloads (JSON) > Implement Change Data Capture > Automate JSON flattening + Data Loading EXAMPLE FAST INTEGRATION OF API PAYLOADS
  • 21.
    © 2021 SnowflakeInc. All Rights Reserved © 2021 Snowflake Inc. All Rights Reserved API INTEGRATION VIA EXTERNAL FUNCTION SF Admin Task, typically not done by developers v v External API’s can now be called via SQL!
  • 22.
    © 2021 SnowflakeInc. All Rights Reserved © 2021 Snowflake Inc. All Rights Reserved EXTERNAL API CALL VIA SQL (EXAMPLE 1) Table with column SOURCE_DATA (VARIANT data type) containing a collection of raw datasets from multiple open data sources v Insert Statement calling external API
  • 23.
    © 2021 SnowflakeInc. All Rights Reserved © 2021 Snowflake Inc. All Rights Reserved EXTERNAL API CALL VIA SQL (EXAMPLE 2) Fuel price data of multiple gas stations v Insert Statement calling external API
  • 24.
    © 2021 SnowflakeInc. All Rights Reserved © 2021 Snowflake Inc. All Rights Reserved AUTOMATED DELTA LOAD WITH STREAMS AND TASKS V Task will only start if table stream has new data records to process à saves compute resources! Only CDC data records of interest will be processed and then cleared from stream when committed V Lateral view and flatten table function used to split price data by Gas Station and store as separate records in the target table REMOTE_FUEL_PRICES V
  • 25.
    © 2021 SnowflakeInc. All Rights Reserved API DATA IS NOW PREPARED FOR FURTHER USE… New fuel prices prepared and stored in target table REMOTE_FUEL_PRICES (still in JSON format) V
  • 26.
    © 2021 SnowflakeInc. All Rights Reserved …LIKE IN A DASHBOARD COMBINING HISTORICAL + RT DATA © 2021 Snowflake Inc. All Rights Reserved V Final BI Query: Reading, formatting and joining JSON price data directly with master data – with fast query performance
  • 27.
    © 2021 SnowflakeInc. All Rights Reserved 27 Key Steps >Identify & select dataset of interest in Snowflake Data Marketplace >Request new (paid) dataset and agree on terms and conditions >Once request is approved, access to the Shared Database will be granted by Data Provider > Query and blend own and shared data for new insights EXAMPLE SNOWFLAKE DATA MARKETPLACE
  • 28.
    © 2021 SnowflakeInc. All Rights Reserved PUBLIC AND PERSONALIZED DATASETS… 28 Step 1: Select dataset of interest V © 2021 Snowflake Inc. All Rights Reserved
  • 29.
    © 2021 SnowflakeInc. All Rights Reserved …AVAILABLE BY (PAID) REQUEST… 29 Step 2: Request shared dataset for access in own Snowflake Account
  • 30.
    © 2021 SnowflakeInc. All Rights Reserved …READY TO QUERY – DATA SCIENTISTS LOVE IT ALREADY! 30 Shared Database is visible (read-only) in own Snowflake Account – no additional ETL required V V Step 3: Query location features to improve predictive models using Snowflake’s Geo SQL Functions
  • 31.
    © 2021 SnowflakeInc. All Rights Reserved. MORE DATA ENGINEERING WITH SNOWFLAKE © 2021 Snowflake Computing Inc. All Rights Reserved
  • 32.
    © 2021 SnowflakeComputing Inc. All Rights Reserved PLATFORM REQUIREMENT #1: DEDICATED COMPUTE 32 Continuous Loading (4TB/day) S3 <5min SLA Compute Cluster “Medium” Batch Data Loads & Transformations Compute Cluster "Large” Interactive Dashboard 50% < 1s 85% < 2s 95% < 5s Compute Cluster Auto Scale – ”X-Large” x 5 Prod DB Snowflake Shared Data, Multi-Cluster Architecture: All data available in a central repository, all major workloads isolated, elastic compute provides performance on demand Structured & Semi-structured Data at Petabyte-Scale (all encrypted, compressed) Dedicated Compute Cluster incl. allowance to resize, e.g. up to "2X-Large” etc. Data Science / Data Projects
  • 33.
    © 2021 SnowflakeInc. All Rights Reserved 33 PLATFORM REQUIREMENT #2: ELASTIC SCALING IN NO TIME Vertical Scaling: Resize Compute Cluster instantly • Pure Cloud User Experience • Scale up/down in no time, no need to start/stop, or reboot • Benefit: Faster data loads, more complex queries (aggregations, new features) Horizontal Scaling: Automatic Scale-out • Snowflake detects/handles massive parallel workloads and automatically scales back after the load drops • Benefit: constantly good performance, even in peak times
  • 34.
    © 2021 SnowflakeComputing Inc. All Rights Reserved PLATFORM REQUIREMENT #3: PAY FOR WHAT YOU USE ETL and Processing Morning Noon Night Workload Reporting Ad-hoc Analytics Morning Noon Night Workload Morning Noon Night Workload Data Science Morning Noon Night Workload
  • 35.
    © 2021 SnowflakeInc. All Rights Reserved MEETING USERS WHERE THEY ARE 35
  • 36.
    © 2021 SnowflakeInc. All Rights Reserved DEVELOPER EXPERIENCE WITH SNOWFLAKE 36 Coding: Data Pipelines, Debugging – e.g. with VS Code, Scala and Snowpark (in Development) RDBMS Development: Database models, use data dictionary, RBAC – e.g. with Snowsight, DBeaver, SqlDBM Data Science Notebook: Live code embedded in markup text + visualizations – e.g. with Jupyter and SF Python Connector
  • 37.
    © 2021 SnowflakeComputing Inc. All Rights Reserved 37 DON’T FORGET SQL FOR FEATURE ENGINEERING Create Compute-intensive Features in Snowflake à Eliminates need for frameworks such as Featuretools for Python Examples • Calculate number of sales per product for last week/month/year • Rank products within different product categories by price, revenue, etc. • Calculate share of revenue or margin for a given product • Find the lowest ever sales price for all products in each product category Real code example, but simplified for increased readability Adopted from Elkjøp’s presentation @ Snowflake Stockholm Meetup, 2021
  • 38.
    © 2021 SnowflakeComputing Inc. All Rights Reserved 38 SNOWFLAKE FEATURES FOR REPRODUCIBILITY TIME TRAVEL All tables can be versioned for 0 - 90 days • Extremely useful for tracking changes and debugging master data issues à Eliminates need to maintain separate history tables in many cases • Intuitive SQL syntax for querying tables at a specific point in time ZERO-COPY CLONING Take “snapshots” of any table, schema or database at no extra cost • Useful for creating static datasets when developing ML models à eliminates need to create datasets in Azure Blob Storage / AWS S3 • Associate Git commit (code version) with cloned table (data version) à makes it easier to reproduce an experiment in the future Real code example, but simplified for increased readability Adopted from Elkjøp’s presentation @ Snowflake Stockholm Meetup, 2021
  • 39.
    © 2021 SnowflakeComputing Inc. All Rights Reserved 39 REPORT OF A CUSTOMER’S DATA SCIENTIST TEAM
  • 40.
    © 2021 SnowflakeInc. All Rights Reserved DATA DEVELOPMENT: PYTHON & SQL COMBINED 41 à Notebook
  • 41.
    © 2021 SnowflakeInc. All Rights Reserved DATA DEVELOPMENT: SCALA VIA SNOWPARK 42 SNOWPARK – A new developer experience that allows you to write Snowflake code in your preferred way, and execute it directly in Snowflake (incl. all platform benefits) IN PRIVATE-PREVIEW
  • 42.
    © 2021 SnowflakeInc. All Rights Reserved DATA DEVELOPMENT: SCALA VIA SNOWPARK 43 A comparative example SQL Snowpark Write, use functions. Monolithic, hard to debug Pipeline, easy to debug Repetitive, inflexible operations Flexible, reusable operations IN PRIVATE-PREVIEW
  • 43.
    © 2021 SnowflakeInc. All Rights Reserved DATA DEVELOPMENT: SCALA VIA SNOWPARK 44 Deep Dive and Demo of Preview Version Webinar Recording: Building Open Source Data Science Models on Snowflake IN PRIVATE-PREVIEW
  • 44.
    © 2021 SnowflakeInc. All Rights Reserved AGENDA – PART 2 45 Source: github.com/szilard/ml-prod (Dr. Szilard Pafka) ML LIFECYCLE Data Science Platform (DataRobot) • Auto ML • ML Ops • Model Deployment Experiment (Lab environment) Value added (ML in Production, at scale)
  • 45.
    © 2021 SnowflakeInc. All Rights Reserved. DATA SCIENCE (AUTO ML) PLATFORMS TO COMPLEMENT SNOWFLAKE CAPABILITIES © 2021 Snowflake Computing Inc. All Rights Reserved
  • 46.
    © 2021 SnowflakeInc. All Rights Reserved DATA SCIENCE PLATFORM – ARCHITECTURE (DATAROBOT) 47
  • 47.
    © 2021 SnowflakeInc. All Rights Reserved DATA SCIENCE PLATFORM – COMPONENTS (DATAROBOT) 48
  • 48.
    © 2021 SnowflakeInc. All Rights Reserved AUTO ML – LEADERBOARD OF A PROJECT 49 à DataRobot
  • 49.
    © 2021 SnowflakeInc. All Rights Reserved ML OPERATIONS 50 Any ML Ops solution capable and agile enough to handle complex situations has to be proficient in the following four critical areas to safely deliver machine learning applications in production at scale
  • 50.
    © 2021 SnowflakeInc. All Rights Reserved ML OPS – MONITORING (SERVICE HEALTH) 51
  • 51.
    © 2021 SnowflakeInc. All Rights Reserved ML OPS – MONITORING (MODEL ACCURACY) 52
  • 52.
    © 2021 SnowflakeInc. All Rights Reserved INTEGRATION OF PRODUCTIVE ML MODELS 53 © 2021 Snowflake Inc. All Rights Reserved
  • 53.
    © 2021 SnowflakeInc. All Rights Reserved CODING EXAMPLE: DATAROBOT & SNOWFLAKE 54 Notebook example: Snowflake + DataRobot Prediction API Call of DataRobot prediction API V Sample code: DataRobot Github Repo Open Snowflake cursor and and store prediction results in a table for further use
  • 54.
    © 2021 SnowflakeInc. All Rights Reserved MODEL DEPLOYMENT – SOURCE CODE EXPORT 55
  • 55.
    SNOWFLAKE INTEGRATION EXAMPLE DetailedArticle: DataRobot Blog 56 Calling a productive Machine Learning Model via Snowflake External Functions
  • 56.
    © 2021 SnowflakeInc. All Rights Reserved. SESSION TAKEAWAY © 2021 Snowflake Computing Inc. All Rights Reserved
  • 57.
    © 2021 SnowflakeInc. All Rights Reserved SNOWFLAKE DATA CLOUD 58 DATA SOURCES OLTP DATABASES ENTERPRISE APPLICATIONS THIRD-PARTY WEB/LOG DATA IoT DATA CONSUMERS DATA MONETIZATION OPERATIONAL REPORTING AD HOC ANALYSIS REAL-TIME ANALYTICS
  • 58.
    © 2021 SnowflakeInc. All Rights Reserved A COMPLETE & EASY-TO-USE DATA PLATFORM… Structured Data Semi-Structured Data Web APIs IoT Data Visualization / Reporting Data Science Ad hoc Queries Data Sources Stage Presentation / Consumers JSON, AVRO (VARIANT) Hive Metastore Integration External Tables Parquet Load/Unload ANSI SQL Data Lake Warehouse Aggregation Semantic / Federated Elastic Multi- Cluster Compute Data Vault, 3NF Modeling ACID Transactional Consistency Data Masking/ Row-level Sec. *) Sec. + Material. Views Zero Copy Cloning SSO LDAP OAUTH SCIM ODBC/JDBC Python/R/Spark Connector End-to-End Security (RBAC, Encryption at Rest/in Motion) Web UI / Snowsight External Functions Data Sharing / Marketplace Streams (CDC) & Tasks (Scheduler) Time Travel Kafka-Connector / Snowpipe Stored Procs / UDFs Geospatial Snowflake supports Data Lake, Data Warehouse, and Data Engineering workloads Dimensional Modeling Information Schema 59 SnowPark *) (Scala) *) Roadmap item / in Preview
  • 59.
    SNOWFLAKE PLATFORM DATAENGINEERING INTEGRATIONS …FOR ACCELERATING MACHINE LEARNING & AI ● Integration with AutoML platforms and Notebook- based ML ● Read and writeback ● Spark Connector ● Python Connector ● Apache Arrow support ● Snowpipe ● Kafka ● Streams/Tasks ● Reproducibility – Versioning - Features, Models ● “Share” data & results ● Feature Repository – Re-Use ● Time Travel ● Zero-copy cloning • Data Platform as a Service • Instant Scalability & Elasticity • Dedicated Virtual Warehouses • Store and use all data regardless of structure • Cross cloud replication • Pay per use – per second pricing • Data Marketplace & Exchange 60
  • 60.
    © 2021 SnowflakeInc. All Rights Reserved. Q & A © 2021 Snowflake Computing Inc. All Rights Reserved
  • 61.
    © 2021 SnowflakeInc. All Rights Reserved THANK YOU