Cloud Based Data Warehousing and Analytics

© 2015 IBM Corporation
Cloud Based Data Warehousing and
Analytics: A Real Use Case HHS-
1807
Bogdan Sheptunov, Marriott
Bert Van der Linden, IBM
10/29/2015

• IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal
without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction
and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or
legal obligation to deliver any material, code or functionality. Information about potential future
products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our
products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a
controlled environment. The actual throughput or performance that any user will experience will vary
depending upon many factors, including considerations such as the amount of multiprogramming in the
user’s job stream, the I/O configuration, the storage configuration, and the workload processed.
Therefore, no assurance can be given that an individual user will achieve results similar to those stated
here.
Please Note:
2

MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION
Self Service Analytics In Cloud
October 5, 2015

JourneyTowards Next Analytical Platform
• Starting data warehousing architecture and its limitations
• A decision to use BigSQL on cloud
• Constraints we operate within
• Original vision
• Interesting challenges along the way
• Where we are today
• Next Steps
3

Data Domains Spread Across Environments
4
Clickstream
Reservations
Customer
Loyalty
Marketing
B2B Sales Call Center
Operational DataOperational DataOperational DataOperational DataOperational Data
Query single dataset
using SQL
Pull data
from multiple
datasets
User’s PC
DataWarehouse

CPU, Space Limited InThe Environment
ETL
Ad Hoc Analytics
Reporting
In database scoring
Workloads
• Warehouses are CPU bound
• User base is global, hardware
busy with update cycles
• Space limited as well
• Adding capacity requires large
investment

Production environments governed by SDLC
• Deploying new data takes time, process
• Turning an idea into data requires a long process
Table
File
Ticket
DBA
Outage
Table
Idea
Project
Requirements
Outage
ETL developer
DBA

Objective: Make Analysis More Efficient
7
• Create a hypothesis proving environment
• Leverage existing SQL skillset within the organization
• Add capacity in small increments
• Deliver a high performing system
• Add new technical capabilities in unstructured data and text
analytics

Operating Model Changes
8
• A self service environment
• Not an operational environment, there’s no SLA on query responses
• Environment can lag behind the DW somewhat
• Still requires security measures
• Needs a new governance process
• Change the approach to data projects
• Prototype an idea with data before its productionalized

Vision for Analytical Workspace
9
External Data
Sources
DW
Landing Zone .
Operational data marts
Reporting
Analytical
Workspace
Data marts

Going to Cloud
10
• Marriott thinks “Cloud First”
• Better cost control through managing capacity
• Speed to market
• A lot of data originates in cloud or is moving towards cloud
• Leveraging Marriott cloud on Softlayer
• Interesting challenges
• Network bandwidth constraints data that does not originate in cloud
• Data needs to be appropriately secured
• Organization needs processes to function at cloud speeds

View WithTools
11
Data
Warehouses
Landing Zone
Analytical Workspace
PUSH
VIEW
PULL
LOAD
Relational
Data Store
HDFS
Data Store
DW
LZ
On
Workstation
Server or
Workstation
On
Workstation On
Workstation
External DataExternal Data
Sources
On
Workstation
Other ODBC
Connections
Legend:
Phase 1
Future Phase
Netezza
DPS
Federate
Dataclick
Aginity
IBM
Db2
Apache
Parquet

Approach to Data Structures
12
- Initial focus on structured data from existing warehouses
- Source structured as stars, snowflakes
- Release 1 approach:
- Replicate existing structures as is
- Optimize physical data model for BigSQL use
- Create next generation of marts in future releases

Pushing Data to Cloud Required a Custom App
13
- Build vs Buy – developed an application from scratch
- Metadata-driven
- Compresses data in memory, delivers to cloud using SSH/SCP
- Applied to target table through ETL code
- Implemented Change Data Capture patterns
- Implemented to send minimal data over the network
- Timestamp-based
- XID-based
- Full comparison-based / relying on existing deltas
- Full replace

Data Publication Service
14

Unexpected Learnings While Pushing Data
15
- Horizontal (range) partitioning a key optimization technique
- Some of the source tables not conducive to partitioning as is
- Added ETL code to append natural key
- Using Hive for ETL
- BigSQL 3.0 good at querying, bottlenecks at writes
- Writing large number of rows best done in Hive
- HiveQL a new skillset for the organization
- Time spent on addressing data quality issues
- Line terminators (OxOA, OxOC) replaced with blanks
- Backslash (), various double quotes escaped with “”

Load: Uploading Data
16
- Business problem: upload dimension or a small fact from CSV
or Excel
- Big Insights 3.0 user experience is disjointed
- Partnered with Aginity
- Workbench for Hadoop: free tool, available at
http://www.aginity.com/workbench/hadoop/
- Natively supports BigSQL

A User Friendly Wizard Starts with a File…
17

… And ends with a table.
18

Getting It Is A Bit Complicated
19
- No native way to bulk upload without opening lots of ports
- Uploaded into catalog DB2 instance with bulk APIs
.NET
Db2
Driver
BigSQL
coordinator
Db2
SMP
instance
Parquet
table
Batched
API call
Batched inserts
into row based
LOAD
HADOOP
command
Aginity

Other Ways of Bringing Data In
20
- View: federation works
- Successfully federated Netezza to BigSQL
- Lag between source and target creates a governance challenge
- Did not heavily exercise yet
- In cloud, difficult to quantify network usage
- Pull: DataClick implementation was postponed
- No way to do high performance uploads (ODBC driver)
- Users have to convert to Parquet, collect statistics themselves
- Impossible to trim data
- Use and administration is not trivial

Learnings From Porting Code
21
- Getting queries from Netezza was easy
- Ported queries, not stored procedures or views
- Followed typical best practices
- Horizontal partitioning most effective performance
optimization technique
- Statistics
- Having BigSQL statistics on each column vital
- ANALYZE statement expensive, breaks occasionally, reruns
- Column group statistics help if there’s a significant skew

Sample RunTimes
22
• Data volume: about 1 tb
• Run times ratio ranges from 0.5 to 3x
Netezza BigSQL
# Query Name Seconds Seconds
1 A1 AW UAT AO Easy 258 769
2 A2 AW UAT AO Medium 1 73 168
3 A3 AW UAT AO Medium 2 156 204
4 A4 AW UAT AO Complex 1 282 436
5 A5 AW UAT AO Complex 2 529 363
6 D1 AW UAT OR Easy 716 431
7 D2 AW UAT OR Medium 254 377
8 D3 AW UAT OR Complex 1 1187 741

Binary Collation Is Standard in Hadoop Ecosystem
23
- “Hello” won’t match / join / sort with “Hello “
- User data, data warehouse contain trailing blanks at times
- Impossible to influence this behavior in Big Insights 3.0
- Options:
- Trim all character types during upload
- Trim at query time
- Appreciate the existing behavior :)

Other Notes
24
- BigSQL 3.0 suitable for writing out small tables, bottlenecks
on large ones
- Ganglia irreplaceable for monitoring hardware
- Unable to push BigInsights metrics to Ganglia in 3.0
- SAS integration successful
- Great example of BigSQL thinly veiled as Db2

Current State
25
- In production
- Some datasets already enabled, more on the way
- Initial group of users started in the environment
- Improving monitoring, error reporting and recovery practices

Next Steps
26
- Expanding user base
- Design next generation of data marts
- Migrating to Big Insights 4.1
- Leveraging BigSQL for large writes
- Configuring High Availability
- Using built in text analytics functions on unstructured data
- Considering high speed file transfer software as transport
layer
- Trimming character data during upload in Aginity

What I will be talking about…
• I’m going to touch on some aspects of BigSQL
• Data ingesting
is always challenging, and we heard it from Marriott
• Spark is the new buzz word
What does it mean for BigSQL?
27

Data Movement
• LOAD HADOOP
• DataStage
• Aspera
• Partners like Aginity
• BigSQL Federation
28

Big SQL LOAD command
• Where can the data come from?
Database via parallel JDBC
• DB2, Netezza, Teradata, Oracle, SQL Server, MySQL, Postgres
• Generic JDBC (Informix and IMS can use this)
CSV files on HDFS
SFTP
• Where can the data go to?
Any BigSQL/Hive table
• Special features:
User can control the parallelism
Rejected/bad rows are saved to a file
Control of how to manipulate input data (e.g. delimiters)
Control of how to write data (e.g. compression, reject nulls)
29

Big SQL LOAD example
• LOAD HADOOP
USING FILE URL '/tmp/data/staff.csv'
WITH SOURCE PROPERTIES('field.delimiter'=':')
INTO TABLE TEST.STAFF_F
WITH LOAD PROPERTIES ( 'num.map.tasks' = 10)
APPEND
• LOAD HADOOP
USING JDBC CONNECTION URL
'jdbc:teradata://myhost/database=GOSALES’
WITH PARAMETERS ( 'user' ='myuser',password='mypass')
FROM SQL QUERY
'SELECT * FROM COUNTRY WHERE SALESCOUNTRYCODE > 6 AND
$CONDITIONS’
SPLIT COLUMN SALESCOUNTRYCODE
INTO TABLE country_info
APPEND;
30

Information Server - DataStage
• BigInsights was shipping with a DataClick teaser-version
That explains the lack of functionality and the lack of
performance that Marriott encountered
We don’t like teaser versions anymore…
This is not representative of DataStage!
Note that there are several names for the same technology
and/or different configurations and deployments
• DataStage
• BigIntegrate & BigQuality
• InfoSphere Information Server
• DataWorks
• DataClick
31

IBM BigInsights BigIntegrate & BigQuality
Information Server on Hadoop
Hadoop Platform
HDFS
YARN
high speed
extract / load
(redundant,
reliable storage)
(cluster resource
management)
BigIntegrate BigQuality
connect, transform,
shape, deliver
profile, classify,
cleanse, monitor
high
speed
native
access
high
speed
ingest
Data Integration, Quality and Governance Tooling
Data Engineers Data Analyst Developers

Big SQL Query federation
• Data never lives in isolation
Either as a landing zone or a queryable archive it is desirable to
query data across Hadoop and active Data warehouses
• Big SQL provides the ability to query heterogeneous systems
Join Hadoop to other relational databases
Query optimizer understands capabilities of external system
• Including available statistics
As much work as possible is pushed to each system to process
33
Head Node
Big SQL
Data Node
Task
Tracker
Data
Node
Big
SQL
Data Node
Task
Tracker
Data
Node
Big
SQL
Data Node
Task
Tracker
Data
Node
Big
SQL
Data Node
Task
Tracker
Data
Node
Big
SQL

Big SQL Federation
What does is really look like?
• After some DDL statements that creates a “nick name” for a
remote table…
create server my_db type teradata …
create nickname T2(...) for server my_db
• This is what the SQL looks like:
Select * from T1, T2
where T1.id = T2.id
and T2.price > 10.50
• Federation is totally invisible from SQL!!!
34

Federation supported data sources
• Teradata
V12, 13, 14
• Oracle
11g, 11gR1, 11gR2, 12c
• Microsoft
2005, 2008, 2008R2, 2012
• DB2
9.7, 9.8, 10.1, 10.5
• Netezza
4.6, 5.0, 6.0, 7.2
• For more details:
http://www-01.ibm.com/support/docview.wss?uid=swg27038537
35

Aginity, an IBM partner
• Chosen by Marriott
36

Aspera, and IBM company
• Claim to fame:
Very fast data transfer via compression and connection
management
• Grew into a much broader offering
• Great for cloud environments
37

Aspera Product Portfolio
TRANSFER CLIENTS WEB APPLICATIONS MANAGEMENT &
AUTOMATION
SYNCHRONIZATION
FASP™ PATENTED HIGH-SPEED TRANSPORT
TRANSFER SERVERS
Web, Desktop, Email, Mobile,
Embedded
Private On Premise
Distribution, sharing,
collaboration and exchange
Transfer management,
monitoring and automation
Scalable, high-performance
synchronization and replication
Any Data Size, Any Distance, Any Network Conditions Any Infrastructure: Block, Object, On Premise, Cloud
Public and Private Cloud Hybrid
38

Questions from our customers
• What about Spark?
• When should I use Spark SQL?
• When should I use BigSQL?
• Is Spark SQL fast?
40

Questions from our customers
• What about Spark?
Spark is built for “analytics”, machine learning
But SQL is so great that everybody has to have SQL…
Using NoteBooks as the canvas
Using SQL to do certain steps that are easy in SQL.
• When should I use Spark SQL?
It will be on your fingertips when you use Spark and its tooling
Very easy in the Java/Scala/Phyton environment of Spark
• When should I use BigSQL?
Obvious for SQL-centric applications
Very easy for remotely connecting via JDBC/ODBC
• Is Spark SQL fast?
41

© 2015 IBM Corporation42
Current State of the Art: Big SQL runs more SQL out-of-box
Big SQL 4.1 Spark SQL 1.5.0
1 hour 3-4 weeksPorting Effort:
Big SQL is the
only engine that
can execute all 99
queries with
minimal porting
effort

… what happens when you scale it?
Scale Single Stream 4 Concurrent Streams
1 TB • Big SQL was faster on 76 / 99
Queries
• Big SQL averaged 5.5X faster
• Removing Top / Bottom 5, Big SQL
averaged 2.5X faster
• Spark SQL FAILED on 3 queries
• Big SQL was 4.4X faster*
10 TB • Big SQL was faster on 80/99 Queries
• Spark SQL FAILED on 7 queries
• Big SQL averaged 6.2X faster*
• Removing Top / Bottom 5, Big SQL
averaged 4.6X faster
• Big SQL elapsed time for workload was
better than linear
• Spark SQL could not complete the
workload (numerous issues). Partial results
possible with only 2 concurrent streams.
*Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL)
More Users
MoreData

Choose the Right Tool for the Right Job
Machine Learning
transformation
Simpler SQL
Good Performance
Ideal tool for BI Data
Analysts and production
workloads
Ideal tool for Data Scientists
and discovery
Big SQL Spark SQL
Migrating existing
workloads to Hadoop
Security
Many Concurrent Users
Best in-class Performance
Big SQL & Spark SQL co-exist in the cluster

We Value Your Feedback!
Don’t forget to submit your Insight session and speaker
feedback! Your feedback is very important to us – we use it
to continually improve the conference.
Access the Insight Conference Connect tool at
insight2015survey.com to quickly submit your surveys from
your smartphone, laptop or conference kiosk.
45

46
Notices and Disclaimers
Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form
without written permission from IBM.
U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.
Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for
accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to
update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO
EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO,
LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted
according to the terms and conditions of the agreements under which they are provided.
Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.
Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as
illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other
results in other operating environments may vary.
References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services
available in all countries in which IBM operates or does business.
Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the
views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or
other guidance or advice to any individual participant or their specific situation.
It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the
identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the
customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will
ensure that the customer is in compliance with any law.

47
Notices and Disclaimers (con’t)
Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly
available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance,
compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to
interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights,
trademarks or other intellectual property right.
• IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document
Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM
SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON,
OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®,
pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ,
Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of
International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be
trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
www.ibm.com/legal/copytrade.shtml.

Cloud Based Data Warehousing and Analytics

More Related Content

What's hot

Viewers also liked

Similar to Cloud Based Data Warehousing and Analytics

Recently uploaded

Cloud Based Data Warehousing and Analytics