IBM Cloud Day January 2021 Data Lake Deep Dive

IBM Cloud Day 2021
Deep Dive into Cloud Native Data Lakes with IBM Cloud
Torsten Steinbach, IBM Cloud Data Lake Architect
James Bennett, IBM Cloud Data Lake Offering Lead
21st Jan 2021

Content
• Architecture
• Data Skipping
• COVID-19 Data Lake
• Interactive Demo

IBM Cloud Data Lake
Architecture

Telemetry Data
Explore
ETL
Prep Enrich
Streaming
Optimize Analyze
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
IBM Cloud Data Lake – Big Picture
DWH
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Optional:

IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit

IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü ETL & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+

IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark clusters aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)

IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console

Meta Data
IBM Cloud Data Lake – Meta Data
Cloud Data
ACID
Spark
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL
Object
Storage RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake

Event Streams SQL Query
Object
Storage Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
IBM Cloud Data Lake – 2021 Architecture
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation

Combining Spatial and Temporal Processing
IBM Cloud Object Storage
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t

The SQL Sandwich
Object Storage
Object Storage
Data Warehouse
Raw Data
High Quality
Data
Archived Data
SQL ETL
SQL ETL
SQL
Federation
Explore, Prepare &
Batch Analytics
Interactive
Analytics with SLAs
Compliance
Reporting
SQL
SQL
SQL
Blog Article:
SQL Sandwich

Promoting Data After Preparation
SELECT …
INTO <COS URI> <format & layout ops> |
<Db2 service CRN> | <Db2 database URI> /<table name>
[CREATE | OVERWRITE | APPEND] [PARALLELISM <num>]
COS URI: e.g. cos://us-south/myBucket/myFolder/myData.parquet
COS Format/Layout: e.g. STORED AS PARQUET PARTITIONED BY (city, date)
Db2 options:
PARALLELISM: Number of parallel threads for writing (default 1)
Examples:
… INTO db2://db2w-dja.us-south.db2w.cloud.ibm.com/MYSCHEMA.MYTABLE PARALLELISM 20
… INTO crn:v1:bluemix:public:dashdb-for-tx:us-south:s/c38…:cf-service-instance:/MYTABLE
* future
Promote
on COS
Promote
to Db2
Blog Article:
Db2 ETL

What supported formats are analytics friendly?
Blog Article:
Data Layout

Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes

How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter

Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query

Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long

X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average

An Idealized Enterprise Data Lake Topology
Systems
of Record
Systems
of Record
Streaming
Topics
Streaming
Topics
LoB Data Lake Projects
Landing
Zone
Landing
Subscription
Prep
Zone
Enterprise
Zone
LoB Prep
Zone
Archive
Zone
EDW
Systems
of Record LoB DBs
Streaming
Topics
Scheduled
ETL or CDC
LoB Analytic
Zone
Publish
Analytic
Apps
Analytic
Apps
Analytic
Apps
Analytic
Apps
Automatic
On Demand
Read only
IBM Confidential

Making trusted COVID-19 data available to broad set of analytics, e.g.:
§ https://accelerator.weather.com/bi
§ Watson Health Return to Work Advisor
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data

The Four P of Serverless Pipelines
Data
Operations
COS
Object
Operations
Prototype
ibm_boto3
Python
COS
ibmcloudsql
Create
Notebooks
PoC
Schedule
Notebooks
Deploy Cloud
Functions
Productize
Watson Studio
Notebooks
Watson Studio
Notebooks

COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
External
Data
Sources
Pull
Push
Collectors Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS

US_DEMOGRAPHIC
geo_id
geolevel
total_population
male
:
35_to_40_years
:
attribution
attribution_url
COVID-19 Data Mart –
Data Model
COUNTY_STATISTICS
county_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
COUNTIES
county_id
county_ name
country_id
province_id
code_type
fips_code
nuts_code
EU_DEMOGRAPHIC
geo_id
geolevel
sex_total
sex_y15_29
:
sex_f_age_total
:
attribution
attribution_url
WORLD_DEMOGRAPHIC
geo_id
geolevel
population
migrants_net
:
attribution
attribution_url
PROVINCE_STATISTICS
province_id
dt
collected
attribution_url
deaths(& *_delta)
COUNTRY_STATISTICS
country_id
dt
collected
attribution_url
deaths(& *_delta)
PROVINCES
province_id
province_ name
country_id
code_type
fips_code
nuts_code
GEOGRAPHIC_FULL
geo_id
geolevel
region
lat
lon
geometry_wkt
attribution
attribution_url
COUNTRIES
country_id
country_ name
code_type Fact Tables
Dimension Tables
WORLD _GEOGRAPHIC
(view)
country_id=“WORLD”
EU _GEOGRAPHIC (view)
country_id=“EU”
US _GEOGRAPHIC (view)
country_id=“US”
US_COUNTIES
(view)
country_id=“US”
GEOGRAPHIC (view)
substr(geometry_wkt, 1, 30)
WORLD _GEOGRAPHIC_FULL (view)
country_id=“WORLD”
EU _GEOGRAPHIC_FULL (view)
country_id=“EU”
US _GEOGRAPHIC_FULL (view)
country_id=“US”
US_PROVINCES (view)
country_id=“US”
Views
ECDC_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases
deaths
confirmed_cases_delta
deaths_delta
WHO_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases
deaths
confirmed_cases_delta
deaths_delta
MX_DEMOGRAPHIC
geo_id
geolevel
population
attribution
attribution_url

Cloud Pak for Data
as a Service
IBM Cloud – PaaS Context for Data Lake
IBM Cloud
Data Lake
Telemetry Data
IBM Cloud
Databases
Explore
ETL
Prepare
Enrich
Streaming Ingest
Optimize
Query
ETL
Infuse
Analyze
Organize
Collect
Cognos
Analytics
Watson
Machine
Learning
Watson
Open Scale
Dashboarding AI
IBM SQL Query
IBM Analytic
Engine
Cloud Object
Storage
IBM Event
Streams
IBM Data
Stage
Train
Ladder to AI
Ingest
Watson Knowledge Catalog
Watson
Studio
Data Science
Key Protect
Govern
Protect
Cloud Functions
Automate
IBM Cloud
Databases
Db2
Warehouse
logDNA, sysdig
Operate
Data
Virtualization

Serverless == Self-Service == Empowerment:
Data Producers becoming Data Product Owners
§ Enable data producers to prepare and serve data for analytic consumption
§ Data Lake building blocks are easy to use and easy to automate services
§ Minimize data lake operations and resource cost overhead
§ Objective:
§ Eliminate all hurdles (and excuses) for data producers NOT to serve their data
for analytics
§ Reduce classical data engineers to role of data lake infrastructure providers
§ Reference:
§ Paradigm Shift to Data Mesh:
https://martinfowler.com/articles/data-monolith-to-mesh.html

IBM SQL Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:

IBM SQL Query – Timeseries SQL 2/2
§ Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:

IBM SQL Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Very large polygons (e.g. countries), polar caps, x-ing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation

SQL Query Scale Out Architecture
Data Center 2
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
Data Center 3
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
SQL 1 SQL 1
Data Center 1
20 Kernels
Cluster
Pool
Request Queue
Node 1
Node 3
Node 2
Node 3
…
Kernel
Pools
20
Kernels
…
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
Cloud Object Storage
SQL 6 …
JKG (Web Sockets)

Secure Passing of Custom Data Source Credentials
IBM
Key Protect
User
Data Sources
Query
1. Create User/Password
combination or API Key
2. Store password or API Key
base64-encoded as custom key
3. Submit SQL statement
referencing password or API
Key via key protect CRN
4. Securely retrieve
password or API Key
5. Connect with retrieved
User/Password combination or API Key

Thank you
Torsten Steinbach, Data Lake Services Architect, IBM Cloud, IBM
Resources:
– SQL Query Documentation: https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview
– SQL Query Tutorial: https://dataplatform.cloud.ibm.com/exchange/public/entry/view/4a9bb1c816fb1e0f31fec5d580e4e14d
– SQL Cloud Function: https://hub.docker.com/r/ibmfunctions/sqlquery/
– COVID-19 Data Lake Presentation: https://ibm.biz/Bdq5Ys
– THINK 2020 Cloud Data Lake Presentation: https://ibm.biz/Bdq5Yi
– IBM Cloud Data Lake Team: #wdp-sql-service & #sqlquery-support on Slack
– Blogs:
• https://www.ibm.com/cloud/blog/new-builders/data-lakes-in-the-cloud
• https://www.ibm.com/cloud/blog/big-data-layout
• https://www.ibm.com/cloud/blog/new-builders/big-data
• https://www.ibm.com/cloud/blog/sql-databases-and-object-storage
• https://www.ibm.com/cloud/blog/accelerate-your-big-data-analytics-and-reduce-costs-by-using-ibm-cloud-sql-query
• https://www.ibm.com/cloud/blog/a-serverless-attack-on-ugly-log-archives
• https://www.ibm.com/cloud/blog/announcements/automate-serverless-data-pipelines-for-your-data-warehouse-or-data-lakes

IBM Cloud Day January 2021 Data Lake Deep Dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IBM Cloud Day January 2021 Data Lake Deep Dive

Similar to IBM Cloud Day January 2021 Data Lake Deep Dive (20)

More from Torsten Steinbach

More from Torsten Steinbach (11)

Recently uploaded

Recently uploaded (20)

IBM Cloud Day January 2021 Data Lake Deep Dive