1. Serverless Cloud Data Lake with Spark
for Serving Weather Data
Torsten Steinbach, IBM, Cloud Data Lake Architect
Paula Ta-Shma, IBM, Research Staff Member
27 Jan 2021
2. Agenda
• Cloud Data Lake Architecture with IBM Cloud®
• Serverless Spark
• Ingest + Prep + Analyze + Manage
• Real-time vs. Batch
• Meta Data
• Serving Weather History on Demand
• Use Case
• Architecture
• Performance
• Announcing Xskipper Open Source
4. Telemetry Data
Explore
ETL
Prep Enrich
Streaming
Optimize Analyze
ü Seamless Elasticity
ü Seamless Scalability
ü Highly Cost Effective
ü Long Term Retention
ü Any data formats
ETL
Cloud Data Lake – Big Picture
DWH
Databases
ü Response Time SLAs
ü Warm High-quality Data only
Cloud Data Lake
Analytics
Optional:
5. Serverless Stack for Analytics
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud Functions
Query
Only pay for volume of data
that you really store
Only pay for
amount of
data that you
really scan
Only pay for
CPU that
you really
consume
Blog Article
§ Properties of Serverless:
– No management of resources, hosts and
processes
– Auto-scaling and auto-provisioning based
on actual load
– Precise billing based on really consumed
system resources (memory, storage, CPU,
network, I/O)
– High-Availability is always implicit
E.g.
E.g.
Code Engine
6. SQL Query Service Architecture
2. Read data
4. Read
results
Application
3. Write data
Cloud Data Services
1. Submit SQL
SQL
Event Streams
Query
Db2 on Cloud
Geospatial SQL
Data Skipping
Timeseries SQL
Hive Metastore
Video
Cloud Object Storage
• Using IBM Analytic Engine service
(Spark clusters aaS)
• Large farm of Spark clusters auto-
provisioned & auto-managed in background
• Managing a hot pool of Spark applications
(a.k.a. kernels, using Jupyter Kernel Gateway)
• SQL grammar sandbox
• Auto-scaling of each serverless SQL job
inside large Spark clusters using dynamic
resource allocation
• Intrinsically HA (dispatching across Spark
environments in each availability zone)
7. SQL Query Service - Capabilities
Cloud Data
Data
Transformation
Serverless SQL
Analytics
Object
Storage RDBMS
+
Developers
Data
Engineers
Data Analysts
ü Supports ad-hoc and
unknown data structures
ü ETL & ELT Support
ü 100% Pay-as-you-go (5$/TB)
ü 100% API enabled
ü Automatic Big Data Scale-
Out with Spark
ü 100% Self service, No Setup
Data
Management
+
Data Scientists
ü Built-In Database Catalog &
Data Skipping
Data Ingestion
+
8. Event Streams SQL Query
Object
Storage Meta Data
Integrated Hive Metastore + Kafka Schema Registry + ACID (Iceberg)
Real-Time
Queries
Cloud Data Lake – 2021 Architecture in IBM Cloud
COS
Batch
Queries
Stream Xform
& Joins
Stream data landing
Schema management & enforcement
ETL & Data
Preparation
9. Meta Data
Meta Data in Cloud Data Lake – En Route to Lakehouse
Cloud Data
ACID
Spark
Data Skipping Indexes Governance Policies
& Lineage
Schema, Partitioning,
Statistics
Serverless SQL
Object
Storage RDBMS
Hive
Metastore
Kafka Schema
Registry
Xskipper Iceberg
Watson Knowledge
Catalog
Deltalake
12. decisions
solutions
Map the
atmosphere
every
15 minutes
Process over
400 terabytes
of data daily
Deliver more than 50 billion requests for
weather information every day and produce
25 billion forecasts daily
Source: Qliksense internal report, April 2017; According to internal forecasting system + # of locations in the
world by Lat Lon locations (2 decimal places); 400 terabytes according to internal SUN platform numbers
And has evolved into
13. TWC History on Demand (HoD)
Conditions
Provides access to a
worldwide, hourly, high-
resolution, gridded
dataset of past weather
conditions via a web API
Global 4 km grid
0.044-degree resolution
34 potential weather properties
34 million records added every hour
Geospatial and temporal search
Point, bounding box, and polygon search over a time range
Usage
Averages 600,000 requests per day
Used by clients primarily for machine learning and data analytics
Supports research in domains such as climate science, energy &
utilities, agriculture, transportation, insurance, and retail
14. Problems with the previous implementation
• Expensive
• Our synchronous data access solution is expensive
• Limited storage capacity
• Hard storage limits per cluster with our previous cloud provider and storage solution
• Reduced data provided
• To lower cost and stay below the storage limit, we've reduced our data to land only, and 20 of the
available 34 weather properties
• Clients are limited to small requests
• To allow for a synchronous interaction, clients are required to limit the scope of their requests to
2,400 records
• Slow at retrieving large amounts of data
• Because of the small query sizes, it is time consuming to retrieve large amounts of data
15. New Solution Overview
Serverless approach
• Pay per use -> Low cost
IBM Cloud SQL Query
• Serverless SQL powered by Spark
• Hive Metastore
• Geospatial
• Data Skipping
IBM Cloud Object Storage (COS)
• S3 Compatible API
Apply Best Practices
• Parquet
• Geospatial Data Layout
16. Data Skipping in IBM SQL Query
• Avoid reading irrelevant objects using indexes
• Complements partition pruning -> object level pruning
• Stores aggregate metadata per object to enable skipping decisions
• Indexes are stored in COS
• Supports multiple index types
• Currently MinMax, ValueList, BloomFilter, Geospatial
• Underlying data skipping library is extensible
• New index types can easily be supported
• Enables data skipping on SQL UDFs
• e.g. ST_Contains, ST_Distance etc.
• UDFs are mapped to indexes
17. How Data Skipping Works
Spark SQL Query Execution Flow
Uses Catalyst optimizer and
session extensions API
Query
Prune
partitions
Read data
Query
Prune
partitions
Optional
file filter
Read data
Metadata
Filter
19. Geospatial Data Skipping Example
Example Query
SELECT * FROM Weather STORED AS parquet
WHERE ST_Contains(ST_WKTToSQL('POLYGON((-
78.93 36.00, -78.67 35.78, -79.04 35.90, -
78.93 36.00))'), ST_Point(long, lat))
INTO cos://us-south/results STORED AS parquet
Object Name lat
Min
lat
Max
...
dt=2020-08-17/part-00085 35.02 36.17
dt=2020-08-17/part-00086 43.59 44.95
dt=2020-08-17/part-00087 34.86 40.62
dt=2020-08-17/part-00088 23.67 25.92
...
Metadata
Red objects are not relevant to this query
Raleigh Research
Triangle (US)
Map ST Contains UDF
to necessary conditions
on lat, long
20. X10 Acceleration with Data Skipping and Catalog
Query rewrite approach
(yellow) is the baseline
• Using already optimized data format:
Parquet/ORC
For other formats the
acceleration is much larger
• e.g. CSV/JSON/Avro
Experiment uses Raleigh Research
Triangle query
X10 speedup
on average
21. TWC HoD’s new asynchronous solution
• More cost-effective
• Our use of IBM Cloud SQL Query and Cloud Object Storage has resulted in an order of
magnitude reduction in cost
• Unlimited storage
• With Cloud Object Storage we effectively have an unlimited storage capacity
• Global weather data coverage with all 34 weather properties
• With the reduced cost and unlimited storage we no longer have to limit the data we provide
• Support for large requests
• With an asynchronous interaction, clients can now submit a single request for everything
they're interested in
• Large amounts of data retrieved quickly with a single query
• Because we can rely on IBM Cloud SQL Query using Spark behind the scenes, large queries
complete relatively quickly
• X40 speedup in some cases
23. Xskipper – Extensible Data Skipping – Released
to Open Source !
• Works with Apache Spark today
• Supports many open formats: Parquet, CSV, JSON, Avro, ORC
• Supports MinMax, ValueList and BloomFilter out of the box
• Supports Hive tables
• Define your own data skipping index types
• Use novel data structures
• Apply to novel use cases
• Can be domain specific e.g. for geospatial, genomic, astronomical data etc.
• Enable skipping for your UDFs by mapping them to conditions over your data
skipping indexes
• Can achieve order of magnitude performance acceleration and beyond
• Even for formats with built in data skipping
23
25. Get Involved!
• We welcome contributions to Xskipper e.g.
• Support for new index types and UDFs – can be contributed as plugins
• Support for additional engines beyond Spark
• Integration with table formats
• And more …
• Xskipper repo
• Extensible Data Skipping research paper
• IEEE Big Data 2020 (runner up best paper) – to appear
• Arxiv version
• Data skipping blogs
• Data Skipping for IBM Cloud SQL Query
• Accelerate Your Big Data Analytics and Reduce Costs by Using IBM Cloud
SQL Query
25
26. Thanks!
• Contact Info:
• Torsten Steinbach torsten@de.ibm.com
• Paula Ta-Shma paula@il.ibm.com
• Thanks to the team:
• Ofer Biran, Dat Bui, Linsong Chu, Patrick Dantressangle, Pranita Dewan,
Michael Factor, Oshrit Feder, Raghu Ganti, Erik Goepfert, Michael Haide,
Holly Hassenzahl, Pete Ihlenfeldt, Guy Khazma, Simon Laws, Gal Lushi, Yosef
Moatti, Jeremy Nachman, Daniel Pittner, Mudhakar Srinivasta
• The research leading to some of these results received funding from the European Community’s Horizon
2020 research and innovation program under grant agreement n° 779747
28. Making trusted COVID-19 data collected by IBM TWC available to
broad set of analytics (e.g. https://accelerator.weather.com/bi)
The COVID-19 Data Lake
Ø Extensible with new data sources easily
Ø Maximized velocity and elasticity
Ø Full automation of all pipelines
Ø New pipeline prototype in hours
& productize in 2-3 days
Ø Radically minimizing resource
and operational costs by using IBM Cloud
serverless and full ops automation
Cloud Functions
Cloud
Object Storage
- Persist
- Trigger
- Static Content Creation
- Schema Management
- Pipeline PoCs
- Usage Tutorials
Watson Studio
SQL Query
- Transformation
- Transport
- Table Catalog (Mart)
- Queries
- Export
- Pipeline -Productization
- Automation
- Monitoring & Alerting
- Pull External Data
Video
29. COVID-19 Data Lake Topology – High Level
Landing Zone (E)
Landing Buckets
Preparation Zone (T)
Landing Namespace
Preparation
Namespace
Preparation Buckets
Integration Zone (L)
Dashboarding
DWH
Integration Buckets
Data Mart Instance
Integration
Namespace
Mart Management
Project
Data Mart Access
Project
TWC Scrapers & Pipeline
Collectors Sequences
Preparation Sequences
Mart Sequences
Delivery Sequences
Pipeline Instance
Schema
Management
Static Content
Management
Pipeline Instance
Usage Notebooks
Table Catalog
Preparation Sequences
External
Data
Sources
Pull
Push
Collectors Sequences
Preparation Sequences
Usage Notebooks
Usage Notebooks
Users
Pipeline PoC Project
Preliminary Pipeline
Notebooks
Location
Statistics
Upload
Update
Reference
Data
Add
Partitions
Query &
Extract
Transform
COGNOS