IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?

What? I Don't Need a Database to Do All That
with SQL? (Session 2238)
Torsten Steinbach (Cloud Data & Analytics Architect)
Daniel Pittner (DevOps Architect)
Think 2019 / DOC ID / Feb, 2019 / © 2019 IBM Corporation

Evolution of Mobility
Your own
chauffeur-
driven car
Owning and
driving a car
Renting
a Car Car Service
Flexibility

Evolution of Form Factors
For Big Data Analytics
Enterprise Data
Warehouses
Tightly integrated and
optimized systems
Hadoop
Introduced open data formats &
easy scaling on commodity HW
Cloud-Native:
Serverless Analytics-aaS
• Seamless elasticity
• Pay-per-query consumption
• Analyze data as it sits in an object store
• Disaggregated architecture
• No more infrastructure head aches
The 90-ies 2000 Today

SQL on Object Storage – Gartner Hype Cycle 2018
Think 2019 / DOC ID / Feb, 2019 / © 2019 IBM Corporation

Cloud Data
ETL
Serverless SQL
Analytics
IBM SQL Query
Object
Storage
Db2
+
Developers
Data
Engineers
Data Analysts
ü Perfect for Machine Generated Data
ü Ad-hoc Data Exploration
ü Operationalizing Data Pipelines
ü Big Data Lakes
ü Flexible Data Transformation
ü Extremely affordable. 5$/TB scanned
ü 100% API enabled
ü BI on Object Storage
ü Big Data Scale-Out. Running on Spark
ü 100% Self service – No Setup
Think 2019 / 2238 / Feb, 2019 / © 2019 IBM Corporation

IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018)
2. Read data
4. Read
results
Application
3. Write results
IBM Cloud
Object Storage
Result SetData Set
Data Set
Data Set
1. Submit SQL
SQL
Archive / Export
IBM Cloud Streaming
IBM Streams
Message Hub
Land
Query
Watson IoT
IBM Cloud Query – Architecture
IBM Cloud Databases
Db2 on Cloud
Geospatial SQLData Skipping
Timeseries SQL
Upload

SQL REST
API
SQL Query Usage
Create
Query
SQL Web Console
Watson
Studio
Notebooks
SQL Cloud Function
Integrate Explore
Deploy
IBM Cloud Query – Access Patterns
Node SDK
Python SDK
JDBC

Analyze data without managing a database ✓
Complex timeseries calculations & analytics
Location data analytics
Transform, Reformat and Repartition data
Can I do this with SQL? Yes, you can!
Build data pipelines

Think 2018 / DOC ID / Month XX, 2018 / © 2018 IBM Corporation
The key to performance without a database:
Manage your data layout!

Proper data organization è
better performance and lower cost
10Think 2019 / DOC ID / Month XX, 2019 / © 2019 IBM Corporation
The key factors are:
• Number of bytes shipped
• Number of REST requests
Best practices for structured data:
• Choose the right object size (sweet spot: 128 MB)
• Choose the right format
• Choose the right data layout
• Avoid gzip compressed formats
Applies to SQL Query but also
applies to other Big Data engines
To learn more: https://www.ibm.com/blogs/bluemix/2018/06/big-data-layout/

Which Format is Query-Friendly?

2. Use Hive style partitioning
GPMeterStream/dt=2017-08-17/part-00085.csv
Avoid reading unnecessary objects altogether
Technique has limitations
Best Practice: minimize data scanned
1. Use Parquet
• Column based
• Only read the columns you need
• Column wise compression
• Min/max metadata

Serverless SQL Does Both:
1. Make Your Data Query Friendly
2. Analyze the Data

SELECT … INTO
<Table Locator> [STORED AS CSV | PARQUET | JSON]
PARTITIONED [BY (<column list>)]
[INTO <num> BUCKETS]
[EVERY <num> ROWS]
BY: Produces Hive Style Partitioning
INTO: Produced fix number of partitions (hash partitioned)
EVERY: Produces partitioned of even size (e.g. for pagination)
Table Partitioning Definition

Data Skipping Saving you Time and $
Index All
Objects
IBM Cloud Object Storage
Data Set Objects
SQL
Query
Data Skipping
Indexing
Candidate
Objects
WHERE Clause
Saving Time
and $
SQL Query learns which objects are not relevant to a query
using a data skipping index
CREATE METAINDEX stores index summary metadata for
each object. Much smaller than the data.
SQLs skipping irrelevant objects to significantly reduce I/O
E.g.:
Independent of data formats
Index Types: Min/Max, Value List, Bounding Box
Get location and time of heat waves (>40 celcius)
SELECT lat, long, city, temp, date
FROM weather
WHERE temp > 40.0

Complex timeseries calculations & analytics
Transform, Reformat and Repartition data ✓

IBM Query – Timeseries SQL 1/2
§ Intuitive first-of-a-kind SQL extensions for timeseries operations
§ Industry leading differentiators, including:
• Timeseries transformation functions:
• Correlation, Fourier transformation,
z-normalization, Granger, interpolation,
and distances
• Temporal Joins: SQL support for
Left/Right/Full Inner and Outer joins
of multiple timeseries
Alignment & Joining:
Apply for Beta Now

IBM Query – Timeseries SQL 2/2
§ Further Industry leading differentiators
• Numerical and categorical timeseries types
• Timeseries data skipping for fast queries
• Forecasting:
• ARIMA, BATS, Anomaly detection, etc.
• Subsequence Mining:
• Train & match models for event sequences
• Segmentation:
• Time-based, Record-based, Anchor-based, Burst, and silence
Segmentation:
Apply for Beta Now

Complex timeseries calculations & analytics ✓

IBM Query – Spatial SQL
§ SQL/MM standard to store & analyze spatial data in RDBMS
§ Migration of PostGIS compliant SQL queries
§ Aggregation, computation and join via native SQL syntax
§ Industry leading differentiators
• Geodetic Full Earth support
• Increased developer productivity
• Avoid piece-wise planar projections
• High precision calculations anywhere on the earth
• Support for very large polygons (e.g. countries), polar
caps, geometries crossing anti-meridian
• Spatial data skipping for fast queries
• Native and fine-granular geohash support
• Fast spatial aggregation

Location data analytics ✓

IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018)Sensor Data Analytics with Extended Syntax
Sensor
Data
Query
Location
Analytics
Mobile
Cars
Devices
Land
Location
Filtering
Spatial
Aggregation
GPS
SQL/MM
Sensor
Metrics
t
t
t
Timeseries
Assembly
Timeseries
Join
Timeseries SQL
t

A Stack for Serverless Data & Analytics Solutions
Serverless
Storage
Serverless
Runtimes
Serverless
Analytics
Object
Storage
Cloud
Functions
Query

Use Cases of Cloud Functions Adding Value to SQL
Unstructured Data Prep
SQL Query
Cloud
Functions
Analyze
COSCOS
Extract Features
Automated/Scheduled SQL Execution
SQL Query
Cloud
Functions
Develop SQL Deploy as SQL Cloud Function
Set up Cloud
Function
Trigger/Schedule
Shield Data From Direct Access
SQL Query
Cloud
Functions
Deploy Cloud Function
with COS API Key
User Calls
Function to
Access Data
COS
Grant Execute on SQL
Cloud Function to User
Configure SQL Pipelines
SQL Query
Cloud
Functions
User creates function
sequence to automate flow
of consecutive SQLs
Sequence
SQL Query
Cloud
Functions
1.
2.

IBM Cloud SQL Query – Very High Level Architecture (MVP 1Q 2018)Use for Data Pipelines to fuel BI
Acquire
Query
Data Warehouses &
Databases
Db2 on Cloud
Process Report
ApplicationsApplications
Applications
IoT
Streaming
Devices
Devices
Devices
BI Reporting
Land
Log Messages
Cleanse
Filter
Merge
Aggregate
Compress
Watson Studio
Looker
Cognos
Tableau
Explore
Analyze Analyze
Promote

Location data analytics ✓
Build data pipelines ✓

When Serverless ? When RDBMS?
RDBMS Serverless
Cloud-Native Solutions
Reserved Compute
Open Data Formats
Avoid data load
Schema at read
Interactive SQLs
Seamless elasticity
UDFs required
Transactions
Pay per query
JDBC/ODBC
REST API
Highly resilient/available

11-Feb 10 AM:
2263 – The Future of SQL in IBM Cloud (Inner Circle)
12-Feb 9:30 AM:
2238 – What? I Don't Need a Database to Do All That with SQL?
13-Feb 10:30 AM:
2155 – Cloud-Native Clickstream Analysis in IBM Cloud
13-Feb 4:30 PM:
2282 – Enterprise-Scale Analytics Performance with Cloud Object Storage
14-Feb 2:30 PM:
2166 – Self-Service Cloud Data Management with SQL
15-Feb 8:30 AM:
2162 – A Sharing Economy for Analytics: SQL Query in IBM Cloud
SQL Query @ IBM THINK 2019
Think 2019 / 2263 / February 2019 / © 2019 IBM Corporation

Backup

IBM SQL Query – Available Features (Q1 2019)
Available Now:
• Read, write & transform open data in Object Storage
• CSV, JSON, Parquet, ORC, AVRO
• Full ANSI SQL & scale-out based on Apache Spark
• Including Authoritative Spark SQL Reference
• Geospatial SQL Support
• Automatic partitioning & schema inference
• Writing results w/ hive-style or paginated partitioning
• I/O Exploitation of Hive-style partitioning
• SQL Web UI
• SQL REST API
• Python & Node.JS client SDKs
• IBM Cloud Function integration
• SQL Notebook in Watson Studio
Available for Beta By Invitation:
• Data Skipping Indexes
• Native Timeseries SQL Support
• JDBC Driver support
Upcoming:
• Reading from Cloudant
• Reading / Writing Db2 & other RDBMS
• Reading Shapefile data
• Cataloging SQL Assets

Submit a SQL query
POST https://api.sql-query.cloud.ibm.com/v2/sql_jobs
Runs the SQL in the background and returns a job_id
Detailed info for a SQL query (e.g. status, result location)
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs/{job_id}
Returns JSON with query execution details
List of recent SQL query executions
GET https://api.sql-query.cloud.ibm.com /v2/sql_jobs
Returns JSON array with last 30 SQL submissions and outcomes
IBM Query REST API

Table Locators
cos://<endpoint>/<bucket>/[<prefix>] <format definition>
Endpoint – of your object storage bucket or a short alias
E.g. s3.us-south.cloud-object-storage.appdomain.cloud or us-south
Bucket – name in object storage
Prefix – one or multiple objects (e.g., table partitions) with same prefix
Used in FROM clauses for input data and in target field for result set data
Examples:
cos://us-south/myBucket/myFolder/mySubFolder/myData.parquet
cos://us-geo/otherBucket/myData
cos://us-geo/otherBucket/myData/part
cos://eu-geo/newBucket/

<Table Locator> [STORED AS CSV | PARQUET | JSON]
• Specifies the data format of the input data
• Table schema is automatically inferred at SQL execution time
• Clause is optional, the default is CSV
• Additional parameters for CSV:
• E.g.: FIELDS TERMINATEY BY ‘t’ NOHEADER
Table Format Definition

Use IBM SQL Query to learn Spark SQL
• SQL Query UI is basically an interactive Spark SQL UI
Best of breed Spark SQL Reference
• Complete, intuitive and interactive SQL Reference
• Each sample SQL can immediately be executed as is
https://cloud.ibm.com/docs/services/sql-query/sqlref/sql_reference.html#sql-reference
Spark SQL Reference

Getting started: https://www.ibm.com/cloud/sql-query
SQL Query Intro Video: https://youtu.be/s-FznfHJpoU
SQL Query Starter Notebook in Watson Studio: https://ibm.biz/BdYNrN
SQL Reference: https://ibm.biz/Bd2jF7
SQL Query API doc: https://cloud.ibm.com/apidocs/sql-query
Big Data Layout Best Practices for COS: https://ibm.biz/Bd2jRg
Serverless Data & Analytics: https://ibm.biz/Bd2jF5
Further Resources

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without
notice and at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it
should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal
obligation to deliver any material, code or functionality. Information about potential future products may not
be incorporated into any contract.
The development, release, and timing of any future features or functionality described for our products
remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled
environment. The actual throughput or performance that any user will experience will vary depending upon
many factors, including considerations such as the amount of multiprogramming in the user’s job stream,
the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can
be given that an individual user will achieve results similar to those stated here.
36
Please note

Notices and disclaimers
© 2018 International Business Machines Corporation. No part of this
document may be reproduced or transmitted in any form without
written permission from IBM.
U.S. Government Users Restricted Rights — use, duplication or
disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to
products that have not yet been announced by IBM) has been reviewed
for accuracy as of the date of initial publication and could include
unintentional technical or typographical errors. IBM shall have no
responsibility to update this information. This document is distributed
“as is” without any warranty, either express or implied. In no event,
shall IBM be liable for any damage arising from the use of this
information, including but not limited to, loss of data, business
interruption, loss of profit or loss of opportunity. IBM products and
services are warranted per the terms and conditions of the agreements
under which they are provided.
IBM products are manufactured from new parts or new and used parts.
In some cases, a product may not be new and may have been previously
installed. Regardless, our warranty terms apply.”
Any statements regarding IBM's future direction, intent or product
plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a
controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the
results they may have achieved. Actual performance, cost, savings or
other results in other operating environments may vary.
References in this document to IBM products, programs, or services
does not imply that IBM intends to make such products, programs or
services available in all countries in which IBM operates or does
business.
Workshops, sessions and associated materials may have been prepared
by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for
informational purposes only, and are neither intended to, nor shall
constitute legal or other guidance or advice to any individual participant
or their specific situation.
It is the customer’s responsibility to insure its own compliance
with legal requirements and to obtain advice of competent legal counsel
as to the identification and interpretation of any relevant laws and
regulatory requirements that may affect the customer’s business and
any actions the customer may need to take to comply with such
laws. IBM does not provide legal advice or represent or warrant that its
services or products will ensure that the customer follows any law.

Notices and disclaimers
continued
Information concerning non-IBM products was obtained from the
suppliers of those products, their published announcements or other
publicly available sources. IBM has not tested those products about this
publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed
to the suppliers of those products. IBM does not warrant the quality of
any third-party products, or the ability of any such third-party products
to interoperate with IBM’s products. IBM expressly disclaims all
warranties, expressed or implied, including but not limited to, the
implied warranties of merchantability and fitness for a purpose.
The provision of the information contained herein is not intended to, and
does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
IBM, the IBM logo, ibm.com and [names of other referenced IBM
products and services used in the presentation] are trademarks of
International Business Machines Corporation, registered in many
jurisdictions worldwide. Other product and service names might
be trademarks of IBM or other companies. A current list of IBM
trademarks is available on the Web at “Copyright and trademark
information” at: www.ibm.com/legal/copytrade.shtml.

39
®
https://www.ibm.com/legal/us/en/copytrade.shtml

IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?

Similar to IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL? (20)

More from Torsten Steinbach

More from Torsten Steinbach (8)

Recently uploaded

Recently uploaded (20)

IBM THINK 2019 - What? I Don't Need a Database to Do All That with SQL?