SlideShare a Scribd company logo
1 of 35
Download to read offline
IBM Cloud Day 2021
Deep Dive into Cloud Native Data Lakes with IBM Cloud
Torsten Steinbach, IBM Cloud Data Lake Architect
James Bennett, IBM Cloud Data Lake Offering Lead
21st Jan 2021
Content
• Architecture
• Data Skipping
• COVID-19 Data Lake
• Interactive Demo
IBM Cloud Data Lake
Architecture
Telemetry Data
Explore
ETL
Prep Enrich
Streaming
Optimize Analyze
Ăź Seamless Elasticity
Ăź Seamless Scalability
Ăź Highly Cost Effective
Ăź Long Term Retention
Ăź Any data formats
ETL
IBM Cloud Data Lake – Big Picture
DWH
Databases
Ăź Response Time SLAs
Ăź Warm High-quality Data only
Cloud Data Lake
Analytics
Optional:
IBM Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit
IBM SQL Query – The Central Cloud Data Lake Service
Cloud Data
Data
Transformation
Serverless SQL Query Service
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
Ăź Supports ad-hoc and
unknown data structures
Ăź ETL & ELT Support
Ăź 100% Pay-as-you-go (5$/TB)
Ăź 100% API enabled
Ăź Automatic Big Data Scale-
Out with Spark
Ăź 100% Self service, No Setup
Data
Management
+
Data Scientists
Ăź Built-In Database Catalog &
Data Skipping
Data Ingestion
+
IBM SQL Query Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark clusters aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)
IBM SQL Query – Access Patterns
Create
Query
SQL
Console
Watson
Studio
Notebooks
Cloud Functions
Integrate Explore
Deploy
Python SDK
REST API
JDBC
Object
Store
Console
Event
Streams
Console
Meta Data
IBM Cloud Data Lake – Meta Data
Cloud Data
ACID
Spark
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL
Object
Storage RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
Event Streams SQL Query
Object
Storage Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
IBM Cloud Data Lake – 2021 Architecture
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation
Combining Spatial and Temporal Processing
IBM Cloud Object Storage
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t
The SQL Sandwich
Object Storage
Object Storage
Data Warehouse
Raw Data
High Quality
Data
Archived Data
SQL ETL
SQL ETL
SQL
Federation
Explore, Prepare &
Batch Analytics
Interactive
Analytics with SLAs
Compliance
Reporting
SQL
SQL
SQL
Blog Article:
SQL Sandwich
Promoting Data After Preparation
SELECT …
INTO <COS URI> <format & layout ops> |
<Db2 service CRN> | <Db2 database URI> /<table name>
[CREATE | OVERWRITE | APPEND] [PARALLELISM <num>]
COS URI: e.g. cos://us-south/myBucket/myFolder/myData.parquet
COS Format/Layout: e.g. STORED AS PARQUET PARTITIONED BY (city, date)
Db2 options:
PARALLELISM: Number of parallel threads for writing (default 1)
Examples:
… INTO db2://db2w-dja.us-south.db2w.cloud.ibm.com/MYSCHEMA.MYTABLE PARALLELISM 20
… INTO crn:v1:bluemix:public:dashdb-for-tx:us-south:s/c38…:cf-service-instance:/MYTABLE
* future
Promote
on COS
Promote
to Db2
Blog Article:
Db2 ETL
Data Skipping
What supported formats are analytics friendly?
Blog Article:
Data Layout
Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes
How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter
Data Skipping Example
Weather/dt=2020-08-17/part-00085.parquet
Weather/dt=2020-08-17/part-00086.parquet
Weather/dt=2020-08-17/part-00087.parquet
Weather/dt=2020-08-17/part-00088.parquet
Weather/dt=2020-08-18/part-00001.parquet
Weather/dt=2020-08-18/part-00002.parquet
Data
Object Listing
Example Query
SELECT *
FROM cos://us-geo/twc/Weather STORED AS parquet
WHERE temp > 40
Object Name Temp
Min
Temp
Max
...
dt=2020-08-17/part-00085 7.97 26.77
dt=2020-08-17/part-00086 2.45 23.71
dt=2020-08-17/part-00087 6.46 18.62
dt=2020-08-17/part-00088 23.67 41.02
...
Metadata
Red objects are not relevant to this query
Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long
X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
IBM’s COVID-19
Data Lake
An Idealized Enterprise Data Lake Topology
Systems
of Record
Systems
of Record
Streaming
Topics
Streaming
Topics
LoB Data Lake Projects
LoB Data Lake Projects
LoB Data Lake Projects
Landing
Zone
Landing
Subscription
Prep
Zone
Enterprise
Zone
LoB Prep
Zone
Archive
Zone
EDW
Systems
of Record LoB DBs
Streaming
Topics
Scheduled
ETL or CDC
LoB Analytic
Zone
Publish
Analytic
Apps
Analytic
Apps
Analytic
Apps
Analytic
Apps
Automatic
On Demand
Read only
IBM Confidential
Making trusted COVID-19 data available to broad set of analytics, e.g.:
§ https://accelerator.weather.com/bi
§ Watson Health Return to Work Advisor
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data
The Four P of Serverless Pipelines
Data
Operations
COS
Object
Operations
Prototype
ibm_boto3
Python
COS
ibmcloudsql
Create
Notebooks
PoC
Schedule
Notebooks
Deploy Cloud
Functions
Productize
Watson Studio
Notebooks
Watson Studio
Notebooks
COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
Preparation Sequences
External
Data
Sources
Pull
Push
Collectors Sequences
Preparation Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS
US_DEMOGRAPHIC
geo_id
geolevel
total_population
male
:
35_to_40_years
:
attribution
attribution_url
COVID-19 Data Mart –
Data Model
COUNTY_STATISTICS
county_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
COUNTIES
county_id
county_ name
country_id
province_id
code_type
fips_code
nuts_code
EU_DEMOGRAPHIC
geo_id
geolevel
sex_total
sex_y15_29
:
sex_f_age_total
:
attribution
attribution_url
WORLD_DEMOGRAPHIC
geo_id
geolevel
population
migrants_net
:
attribution
attribution_url
PROVINCE_STATISTICS
province_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
COUNTRY_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases (& *_delta)
deaths(& *_delta)
hospitalized (& *_delta)
testsperformed (& *_delta)
recovered (& *_delta)
PROVINCES
province_id
province_ name
country_id
code_type
fips_code
nuts_code
GEOGRAPHIC_FULL
geo_id
geolevel
region
lat
lon
geometry_wkt
attribution
attribution_url
COUNTRIES
country_id
country_ name
code_type Fact Tables
Dimension Tables
WORLD _GEOGRAPHIC
(view)
country_id=“WORLD”
EU _GEOGRAPHIC (view)
country_id=“EU”
US _GEOGRAPHIC (view)
country_id=“US”
US_COUNTIES
(view)
country_id=“US”
GEOGRAPHIC (view)
substr(geometry_wkt, 1, 30)
WORLD _GEOGRAPHIC_FULL (view)
country_id=“WORLD”
EU _GEOGRAPHIC_FULL (view)
country_id=“EU”
US _GEOGRAPHIC_FULL (view)
country_id=“US”
US_PROVINCES (view)
country_id=“US”
Views
ECDC_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases
deaths
confirmed_cases_delta
deaths_delta
WHO_STATISTICS
country_id
dt
collected
attribution_url
confirmed_cases
deaths
confirmed_cases_delta
deaths_delta
MX_DEMOGRAPHIC
geo_id
geolevel
population
attribution
attribution_url
Backup
Cloud Pak for Data
as a Service
IBM Cloud – PaaS Context for Data Lake
IBM Cloud
Data Lake
Telemetry Data
IBM Cloud
Databases
Explore
ETL
Prepare
Enrich
Streaming Ingest
Optimize
Query
ETL
Infuse
Analyze
Organize
Collect
Cognos
Analytics
Watson
Machine
Learning
Watson
Open Scale
Dashboarding AI
IBM SQL Query
IBM Analytic
Engine
Cloud Object
Storage
IBM Event
Streams
IBM Data
Stage
Train
Ladder to AI
Ingest
Watson Knowledge Catalog
Watson
Studio
Data Science
Key Protect
Govern
Protect
Cloud Functions
Automate
IBM Cloud
Databases
Db2
Warehouse
logDNA, sysdig
Operate
Data
Virtualization
Serverless == Self-Service == Empowerment:
Data Producers becoming Data Product Owners
§ Enable data producers to prepare and serve data for analytic consumption
§ Data Lake building blocks are easy to use and easy to automate services
§ Minimize data lake operations and resource cost overhead
§ Objective:
§ Eliminate all hurdles (and excuses) for data producers NOT to serve their data
for analytics
§ Reduce classical data engineers to role of data lake infrastructure providers
§ Reference:
§ Paradigm Shift to Data Mesh:
https://martinfowler.com/articles/data-monolith-to-mesh.html
IBM SQL Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:
IBM SQL Query – Timeseries SQL 2/2
§ Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:
IBM SQL Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Very large polygons (e.g. countries), polar caps, x-ing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation
SQL Query Scale Out Architecture
Data Center 2
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
Data Center 3
Analytics Engine Cluster
20 Kernels
Node 1
Node 3
Node 2
Node 3
…
20
Kernels
…
SQL 1 SQL 1
Data Center 1
Analytics Engine Cluster
20 Kernels
Cluster
Pool
Request Queue
Node 1
Node 3
Node 2
Node 3
…
Kernel
Pools
20
Kernels
…
SQL 1 SQL 2 SQL 3 SQL 4 SQL 5
Cloud Object Storage
SQL 6 …
JKG (Web Sockets)
Secure Passing of Custom Data Source Credentials
IBM
Key Protect
User
Data Sources
Query
1. Create User/Password
combination or API Key
2. Store password or API Key
base64-encoded as custom key
3. Submit SQL statement
referencing password or API
Key via key protect CRN
4. Securely retrieve
password or API Key
5. Connect with retrieved
User/Password combination or API Key
Thank you
Torsten Steinbach, Data Lake Services Architect, IBM Cloud, IBM
Resources:
– SQL Query Documentation: https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview
– SQL Query Tutorial: https://dataplatform.cloud.ibm.com/exchange/public/entry/view/4a9bb1c816fb1e0f31fec5d580e4e14d
– SQL Cloud Function: https://hub.docker.com/r/ibmfunctions/sqlquery/
– COVID-19 Data Lake Presentation: https://ibm.biz/Bdq5Ys
– THINK 2020 Cloud Data Lake Presentation: https://ibm.biz/Bdq5Yi
– IBM Cloud Data Lake Team: #wdp-sql-service & #sqlquery-support on Slack
– Blogs:
• https://www.ibm.com/cloud/blog/new-builders/data-lakes-in-the-cloud
• https://www.ibm.com/cloud/blog/big-data-layout
• https://www.ibm.com/cloud/blog/new-builders/big-data
• https://www.ibm.com/cloud/blog/sql-databases-and-object-storage
• https://www.ibm.com/cloud/blog/accelerate-your-big-data-analytics-and-reduce-costs-by-using-ibm-cloud-sql-query
• https://www.ibm.com/cloud/blog/a-serverless-attack-on-ugly-log-archives
• https://www.ibm.com/cloud/blog/announcements/automate-serverless-data-pipelines-for-your-data-warehouse-or-data-lakes

More Related Content

What's hot

What's hot (20)

Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
 
Global AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure DatabricksGlobal AI Bootcamp Madrid - Azure Databricks
Global AI Bootcamp Madrid - Azure Databricks
 
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu GantaAzure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
Azure Databricks – Customer Experiences and Lessons Denzil Ribeiro Madhu Ganta
 
McGraw-Hill Optimizes Analytics Workloads with Databricks
 McGraw-Hill Optimizes Analytics Workloads with Databricks McGraw-Hill Optimizes Analytics Workloads with Databricks
McGraw-Hill Optimizes Analytics Workloads with Databricks
 
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
Columbia Migrates from Legacy Data Warehouse to an Open Data Platform with De...
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Module 3 - QuickSight Overview
Module 3 - QuickSight OverviewModule 3 - QuickSight Overview
Module 3 - QuickSight Overview
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2Streaming Real-time Data to Azure Data Lake Storage Gen 2
Streaming Real-time Data to Azure Data Lake Storage Gen 2
 
Tarun poladi resume
Tarun poladi resumeTarun poladi resume
Tarun poladi resume
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Ai & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientistAi & Data Analytics 2018 - Azure Databricks for data scientist
Ai & Data Analytics 2018 - Azure Databricks for data scientist
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Dipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAsDipping Your Toes: Azure Data Lake for DBAs
Dipping Your Toes: Azure Data Lake for DBAs
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Spark Streaming with Azure Databricks
Spark Streaming with Azure DatabricksSpark Streaming with Azure Databricks
Spark Streaming with Azure Databricks
 
Azure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data LakeAzure Lowlands: An intro to Azure Data Lake
Azure Lowlands: An intro to Azure Data Lake
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
Big Data and Data Warehousing Together with Azure Synapse Analytics (SQLBits ...
 

Similar to IBM Cloud Day January 2021 Data Lake Deep Dive

(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
Amazon Web Services
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 

Similar to IBM Cloud Day January 2021 Data Lake Deep Dive (20)

Serverless SQL
Serverless SQLServerless SQL
Serverless SQL
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS(BDT317) Building A Data Lake On AWS
(BDT317) Building A Data Lake On AWS
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
Create cloud service on AWS
Create cloud service on AWSCreate cloud service on AWS
Create cloud service on AWS
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS AWS March 2016 Webinar Series Building Your Data Lake on AWS
AWS March 2016 Webinar Series Building Your Data Lake on AWS
 
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log Analytics with Amazon Elasticsearch Service
 
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Best of re:Invent
Best of re:InventBest of re:Invent
Best of re:Invent
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
DBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data ApplicationsDBP-010_Using Azure Data Services for Modern Data Applications
DBP-010_Using Azure Data Services for Modern Data Applications
 
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924AWS Webcast - Managing Big Data in the AWS Cloud_20140924
AWS Webcast - Managing Big Data in the AWS Cloud_20140924
 

More from Torsten Steinbach

esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
Torsten Steinbach
 

More from Torsten Steinbach (11)

IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
 
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?
 
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM CloudIBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
IBM THINK 2019 - Cloud-Native Clickstream Analysis in IBM Cloud
 
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL IBM THINK 2019 - Self-Service Cloud Data Management with SQL
IBM THINK 2019 - Self-Service Cloud Data Management with SQL
 
IBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query IntroductionIBM THINK 2018 - IBM Cloud SQL Query Introduction
IBM THINK 2018 - IBM Cloud SQL Query Introduction
 
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the CloudIBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
IBM Insight 2014 - Advanced Warehouse Analytics in the Cloud
 
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloudIBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
IBM Insight 2015 - 1823 - Geospatial analytics with dashDB in the cloud
 
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter AnalysisIBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
IBM Insight 2015 - 1824 - Using Bluemix and dashDB for Twitter Analysis
 
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
IBM InterConnect 2016 - 3505 - Cloud-Based Analytics of The Weather Company i...
 
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...IBM Information on Demand 2013  - Session 2839 - Using IBM PureData System fo...
IBM Information on Demand 2013 - Session 2839 - Using IBM PureData System fo...
 
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
esri2015cloudantdashdbpresentation-150731203041-lva1-app6892
 

Recently uploaded

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

IBM Cloud Day January 2021 Data Lake Deep Dive

  • 1. IBM Cloud Day 2021 Deep Dive into Cloud Native Data Lakes with IBM Cloud Torsten Steinbach, IBM Cloud Data Lake Architect James Bennett, IBM Cloud Data Lake Offering Lead 21st Jan 2021
  • 2. Content • Architecture • Data Skipping • COVID-19 Data Lake • Interactive Demo
  • 3. IBM Cloud Data Lake Architecture
  • 4. Telemetry Data Explore ETL Prep Enrich Streaming Optimize Analyze Ăź Seamless Elasticity Ăź Seamless Scalability Ăź Highly Cost Effective Ăź Long Term Retention Ăź Any data formats ETL IBM Cloud Data Lake – Big Picture DWH Databases Ăź Response Time SLAs Ăź Warm High-quality Data only Cloud Data Lake Analytics Optional:
  • 5. IBM Serverless Stack for Analytics Serverless Storage Serverless Runtimes Serverless Analytics Object Storage Cloud Functions Query Only pay for volume of data that you really store Only pay for amount of data that you really scan Only pay for CPU that you really consume Blog Article § Properties of Serverless: – No management of resources, hosts and processes – Auto-scaling and auto-provisioning based on actual load – Precise billing based on really consumed system resources (memory, storage, CPU, network, I/O) – High-Availability is always implicit
  • 6. IBM SQL Query – The Central Cloud Data Lake Service Cloud Data Data Transformation Serverless SQL Query Service Analytics Object Storage RDBMS + Developers Data Engineers Data Analysts Ăź Supports ad-hoc and unknown data structures Ăź ETL & ELT Support Ăź 100% Pay-as-you-go (5$/TB) Ăź 100% API enabled Ăź Automatic Big Data Scale- Out with Spark Ăź 100% Self service, No Setup Data Management + Data Scientists Ăź Built-In Database Catalog & Data Skipping Data Ingestion +
  • 7. IBM SQL Query Architecture 2. Read data 4. Read results Application 3. Write data Cloud Data Services 1. Submit SQL SQL Event Streams Query Db2 on Cloud Geospatial SQL Data Skipping Timeseries SQL Hive Metastore Video Cloud Object Storage • Using IBM Analytic Engine service (Spark clusters aaS) • Large farm of Spark clusters auto- provisioned & auto-managed in background • Managing a hot pool of Spark applications (a.k.a. kernels, using Jupyter Kernel Gateway) • SQL grammar sandbox • Auto-scaling of each serverless SQL job inside large Spark clusters using dynamic resource allocation • Intrinsically HA (dispatching across Spark environments in each availability zone)
  • 8. IBM SQL Query – Access Patterns Create Query SQL Console Watson Studio Notebooks Cloud Functions Integrate Explore Deploy Python SDK REST API JDBC Object Store Console Event Streams Console
  • 9. Meta Data IBM Cloud Data Lake – Meta Data Cloud Data ACID Spark Data Skipping Indexes Governance Policies & Lineage Schema, Partitioning, Statistics Serverless SQL Object Storage RDBMS Hive Metastore Kafka Schema Registry Xskipper Iceberg Watson Knowledge Catalog Deltalake
  • 10. Event Streams SQL Query Object Storage Meta Data Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg) Real-Time Queries IBM Cloud Data Lake – 2021 Architecture COS Batch Queries Stream Xform & Joins Stream data landing Schema management & enforcement ETL & Data Preparation
  • 11. Combining Spatial and Temporal Processing IBM Cloud Object Storage Sensor Data Query Location Analytics Mobile Cars Devices Land Location Filtering Spatial Aggregation GPS SQL/MM Sensor Metrics t t t Timeseries Assembly Timeseries Join Timeseries SQL t
  • 12. The SQL Sandwich Object Storage Object Storage Data Warehouse Raw Data High Quality Data Archived Data SQL ETL SQL ETL SQL Federation Explore, Prepare & Batch Analytics Interactive Analytics with SLAs Compliance Reporting SQL SQL SQL Blog Article: SQL Sandwich
  • 13. Promoting Data After Preparation SELECT … INTO <COS URI> <format & layout ops> | <Db2 service CRN> | <Db2 database URI> /<table name> [CREATE | OVERWRITE | APPEND] [PARALLELISM <num>] COS URI: e.g. cos://us-south/myBucket/myFolder/myData.parquet COS Format/Layout: e.g. STORED AS PARQUET PARTITIONED BY (city, date) Db2 options: PARALLELISM: Number of parallel threads for writing (default 1) Examples: … INTO db2://db2w-dja.us-south.db2w.cloud.ibm.com/MYSCHEMA.MYTABLE PARALLELISM 20 … INTO crn:v1:bluemix:public:dashdb-for-tx:us-south:s/c38…:cf-service-instance:/MYTABLE * future Promote on COS Promote to Db2 Blog Article: Db2 ETL
  • 15. What supported formats are analytics friendly? Blog Article: Data Layout
  • 16. Data Skipping in IBM SQL Query • Avoid reading irrelevant objects using indexes • Complements partition pruning -> object level pruning • Stores aggregate metadata per object to enable skipping decisions • Indexes are stored in COS • Supports multiple index types • Currently MinMax, ValueList, BloomFilter, Geospatial • Underlying data skipping library is extensible • New index types can easily be supported • Enables data skipping on SQL UDFs • e.g. ST_Contains, ST_Distance etc. • UDFs are mapped to indexes
  • 17. How Data Skipping Works Spark SQL Query Execution Flow Uses Catalyst optimizer and session extensions API Query Prune partitions Read data Query Prune partitions Optional file filter Read data Metadata Filter
  • 18. Data Skipping Example Weather/dt=2020-08-17/part-00085.parquet Weather/dt=2020-08-17/part-00086.parquet Weather/dt=2020-08-17/part-00087.parquet Weather/dt=2020-08-17/part-00088.parquet Weather/dt=2020-08-18/part-00001.parquet Weather/dt=2020-08-18/part-00002.parquet Data Object Listing Example Query SELECT * FROM cos://us-geo/twc/Weather STORED AS parquet WHERE temp > 40 Object Name Temp Min Temp Max ... dt=2020-08-17/part-00085 7.97 26.77 dt=2020-08-17/part-00086 2.45 23.71 dt=2020-08-17/part-00087 6.46 18.62 dt=2020-08-17/part-00088 23.67 41.02 ... Metadata Red objects are not relevant to this query
  • 19. Geospatial Data Skipping Example Example Query SELECT * FROM Weather STORED AS parquet WHERE ST_Contains(ST_WKTToSQL('POLYGON((- 78.93 36.00, -78.67 35.78, -79.04 35.90, - 78.93 36.00))'), ST_Point(long, lat)) INTO cos://us-south/results STORED AS parquet Object Name lat Min lat Max ... dt=2020-08-17/part-00085 35.02 36.17 dt=2020-08-17/part-00086 43.59 44.95 dt=2020-08-17/part-00087 34.86 40.62 dt=2020-08-17/part-00088 23.67 25.92 ... Metadata Red objects are not relevant to this query Raleigh Research Triangle (US) Map ST Contains UDF to necessary conditions on lat, long
  • 20. X10 Acceleration with Data Skipping and Catalog Query rewrite approach (yellow) is the baseline • Using already optimized data format: Parquet/ORC For other formats the acceleration is much larger • e.g. CSV/JSON/Avro Experiment uses Raleigh Research Triangle query X10 speedup on average
  • 22. An Idealized Enterprise Data Lake Topology Systems of Record Systems of Record Streaming Topics Streaming Topics LoB Data Lake Projects LoB Data Lake Projects LoB Data Lake Projects Landing Zone Landing Subscription Prep Zone Enterprise Zone LoB Prep Zone Archive Zone EDW Systems of Record LoB DBs Streaming Topics Scheduled ETL or CDC LoB Analytic Zone Publish Analytic Apps Analytic Apps Analytic Apps Analytic Apps Automatic On Demand Read only IBM Confidential
  • 23. Making trusted COVID-19 data available to broad set of analytics, e.g.: § https://accelerator.weather.com/bi § Watson Health Return to Work Advisor The COVID-19 Data Lake Ø Extensible with new data sources easily Ø Maximized velocity and elasticity Ø Full automation of all pipelines Ø New pipeline prototype in hours & productize in 2-3 days Ø Radically minimizing resource and operational costs by using IBM Cloud serverless and full ops automation Cloud Functions Cloud Object Storage - Persist - Trigger - Static Content Creation - Schema Management - Pipeline PoCs - Usage Tutorials Watson Studio SQL Query - Transformation - Transport - Table Catalog (Mart) - Queries - Export - Pipeline -Productization - Automation - Monitoring & Alerting - Pull External Data
  • 24. The Four P of Serverless Pipelines Data Operations COS Object Operations Prototype ibm_boto3 Python COS ibmcloudsql Create Notebooks PoC Schedule Notebooks Deploy Cloud Functions Productize Watson Studio Notebooks Watson Studio Notebooks
  • 25. COVID-19 Data Lake Topology – High Level Landing Zone (E) Landing Buckets Preparation Zone (T) Landing Namespace Preparation Namespace Preparation Buckets Integration Zone (L) Dashboarding DWH Integration Buckets Data Mart Instance Integration Namespace Mart Management Project Data Mart Access Project TWC Scrapers & Pipeline Collectors Sequences Preparation Sequences Mart Sequences Delivery Sequences Pipeline Instance Schema Management Static Content Management Pipeline Instance Usage Notebooks Table Catalog Preparation Sequences External Data Sources Pull Push Collectors Sequences Preparation Sequences Usage Notebooks Usage Notebooks Users Pipeline PoC Project Preliminary Pipeline Notebooks Location Statistics Upload Update Reference Data Add Partitions Query & Extract Transform COGNOS
  • 26. US_DEMOGRAPHIC geo_id geolevel total_population male : 35_to_40_years : attribution attribution_url COVID-19 Data Mart – Data Model COUNTY_STATISTICS county_id dt collected attribution_url confirmed_cases (& *_delta) deaths(& *_delta) hospitalized (& *_delta) testsperformed (& *_delta) recovered (& *_delta) COUNTIES county_id county_ name country_id province_id code_type fips_code nuts_code EU_DEMOGRAPHIC geo_id geolevel sex_total sex_y15_29 : sex_f_age_total : attribution attribution_url WORLD_DEMOGRAPHIC geo_id geolevel population migrants_net : attribution attribution_url PROVINCE_STATISTICS province_id dt collected attribution_url confirmed_cases (& *_delta) deaths(& *_delta) hospitalized (& *_delta) testsperformed (& *_delta) recovered (& *_delta) COUNTRY_STATISTICS country_id dt collected attribution_url confirmed_cases (& *_delta) deaths(& *_delta) hospitalized (& *_delta) testsperformed (& *_delta) recovered (& *_delta) PROVINCES province_id province_ name country_id code_type fips_code nuts_code GEOGRAPHIC_FULL geo_id geolevel region lat lon geometry_wkt attribution attribution_url COUNTRIES country_id country_ name code_type Fact Tables Dimension Tables WORLD _GEOGRAPHIC (view) country_id=“WORLD” EU _GEOGRAPHIC (view) country_id=“EU” US _GEOGRAPHIC (view) country_id=“US” US_COUNTIES (view) country_id=“US” GEOGRAPHIC (view) substr(geometry_wkt, 1, 30) WORLD _GEOGRAPHIC_FULL (view) country_id=“WORLD” EU _GEOGRAPHIC_FULL (view) country_id=“EU” US _GEOGRAPHIC_FULL (view) country_id=“US” US_PROVINCES (view) country_id=“US” Views ECDC_STATISTICS country_id dt collected attribution_url confirmed_cases deaths confirmed_cases_delta deaths_delta WHO_STATISTICS country_id dt collected attribution_url confirmed_cases deaths confirmed_cases_delta deaths_delta MX_DEMOGRAPHIC geo_id geolevel population attribution attribution_url
  • 28. Cloud Pak for Data as a Service IBM Cloud – PaaS Context for Data Lake IBM Cloud Data Lake Telemetry Data IBM Cloud Databases Explore ETL Prepare Enrich Streaming Ingest Optimize Query ETL Infuse Analyze Organize Collect Cognos Analytics Watson Machine Learning Watson Open Scale Dashboarding AI IBM SQL Query IBM Analytic Engine Cloud Object Storage IBM Event Streams IBM Data Stage Train Ladder to AI Ingest Watson Knowledge Catalog Watson Studio Data Science Key Protect Govern Protect Cloud Functions Automate IBM Cloud Databases Db2 Warehouse logDNA, sysdig Operate Data Virtualization
  • 29. Serverless == Self-Service == Empowerment: Data Producers becoming Data Product Owners § Enable data producers to prepare and serve data for analytic consumption § Data Lake building blocks are easy to use and easy to automate services § Minimize data lake operations and resource cost overhead § Objective: § Eliminate all hurdles (and excuses) for data producers NOT to serve their data for analytics § Reduce classical data engineers to role of data lake infrastructure providers § Reference: § Paradigm Shift to Data Mesh: https://martinfowler.com/articles/data-monolith-to-mesh.html
  • 30. IBM SQL Query – Timeseries SQL 1/2 § Intuitive first-of-a-kind SQL extensions for timeseries operations § Industry leading differentiators, including: • Timeseries transformation functions: • Correlation, Fourier transformation, z-normalization, Granger, interpolation, and distances • Temporal Joins: SQL support for Left/Right/Full Inner and Outer joins of multiple timeseries Alignment & Joining:
  • 31. IBM SQL Query – Timeseries SQL 2/2 § Further Industry leading differentiators • Numerical and categorical timeseries types • Timeseries data skipping for fast queries • Forecasting: • ARIMA, BATS, Anomaly detection, etc. • Subsequence Mining: • Train & match models for event sequences • Segmentation: • Time-based, Record-based, Anchor-based, Burst, and silence Segmentation:
  • 32. IBM SQL Query – Spatial SQL § SQL/MM standard to store & analyze spatial data in RDBMS § Migration of PostGIS compliant SQL queries § Aggregation, computation and join via native SQL syntax § Industry leading differentiators • Geodetic Full Earth support • Increased developer productivity • Avoid piece-wise planar projections • High precision calculations anywhere on the earth • Very large polygons (e.g. countries), polar caps, x-ing anti-meridian • Spatial data skipping for fast queries • Native and fine-granular geohash support • Fast spatial aggregation
  • 33. SQL Query Scale Out Architecture Data Center 2 Analytics Engine Cluster 20 Kernels Node 1 Node 3 Node 2 Node 3 … 20 Kernels … Data Center 3 Analytics Engine Cluster 20 Kernels Node 1 Node 3 Node 2 Node 3 … 20 Kernels … SQL 1 SQL 1 Data Center 1 Analytics Engine Cluster 20 Kernels Cluster Pool Request Queue Node 1 Node 3 Node 2 Node 3 … Kernel Pools 20 Kernels … SQL 1 SQL 2 SQL 3 SQL 4 SQL 5 Cloud Object Storage SQL 6 … JKG (Web Sockets)
  • 34. Secure Passing of Custom Data Source Credentials IBM Key Protect User Data Sources Query 1. Create User/Password combination or API Key 2. Store password or API Key base64-encoded as custom key 3. Submit SQL statement referencing password or API Key via key protect CRN 4. Securely retrieve password or API Key 5. Connect with retrieved User/Password combination or API Key
  • 35. Thank you Torsten Steinbach, Data Lake Services Architect, IBM Cloud, IBM Resources: – SQL Query Documentation: https://cloud.ibm.com/docs/sql-query?topic=sql-query-overview – SQL Query Tutorial: https://dataplatform.cloud.ibm.com/exchange/public/entry/view/4a9bb1c816fb1e0f31fec5d580e4e14d – SQL Cloud Function: https://hub.docker.com/r/ibmfunctions/sqlquery/ – COVID-19 Data Lake Presentation: https://ibm.biz/Bdq5Ys – THINK 2020 Cloud Data Lake Presentation: https://ibm.biz/Bdq5Yi – IBM Cloud Data Lake Team: #wdp-sql-service & #sqlquery-support on Slack – Blogs: • https://www.ibm.com/cloud/blog/new-builders/data-lakes-in-the-cloud • https://www.ibm.com/cloud/blog/big-data-layout • https://www.ibm.com/cloud/blog/new-builders/big-data • https://www.ibm.com/cloud/blog/sql-databases-and-object-storage • https://www.ibm.com/cloud/blog/accelerate-your-big-data-analytics-and-reduce-costs-by-using-ibm-cloud-sql-query • https://www.ibm.com/cloud/blog/a-serverless-attack-on-ugly-log-archives • https://www.ibm.com/cloud/blog/announcements/automate-serverless-data-pipelines-for-your-data-warehouse-or-data-lakes