Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cloud Based Data Warehousing and Analytics


Published on

Marriott use case of big data using BigInsights

Published in: Data & Analytics
  • Be the first to comment

Cloud Based Data Warehousing and Analytics

  1. 1. © 2015 IBM Corporation Cloud Based Data Warehousing and Analytics: A Real Use Case HHS- 1807 Bogdan Sheptunov, Marriott Bert Van der Linden, IBM 10/29/2015
  2. 2. • IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. • Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. • The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. • The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. Please Note: 2
  4. 4. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION JourneyTowards Next Analytical Platform • Starting data warehousing architecture and its limitations • A decision to use BigSQL on cloud • Constraints we operate within • Original vision • Interesting challenges along the way • Where we are today • Next Steps 3
  5. 5. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Data Domains Spread Across Environments 4 Clickstream Reservations Customer Loyalty Marketing B2B Sales Call Center Operational DataOperational DataOperational DataOperational DataOperational Data Query single dataset using SQL Pull data from multiple datasets User’s PC DataWarehouse
  6. 6. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION CPU, Space Limited InThe Environment ETL Ad Hoc Analytics Reporting In database scoring Workloads • Warehouses are CPU bound • User base is global, hardware busy with update cycles • Space limited as well • Adding capacity requires large investment
  7. 7. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Production environments governed by SDLC • Deploying new data takes time, process • Turning an idea into data requires a long process Table File Ticket DBA Outage Table Idea Project Requirements Outage ETL developer DBA
  8. 8. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Objective: Make Analysis More Efficient 7 • Create a hypothesis proving environment • Leverage existing SQL skillset within the organization • Add capacity in small increments • Deliver a high performing system • Add new technical capabilities in unstructured data and text analytics
  9. 9. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Operating Model Changes 8 • A self service environment • Not an operational environment, there’s no SLA on query responses • Environment can lag behind the DW somewhat • Still requires security measures • Needs a new governance process • Change the approach to data projects • Prototype an idea with data before its productionalized
  10. 10. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Vision for Analytical Workspace 9 External Data Sources Operational DataOperational DataOperational DataOperational DataOperational Data DW Landing Zone . Operational data marts Reporting Analytical Workspace Data marts
  11. 11. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Going to Cloud 10 • Marriott thinks “Cloud First” • Better cost control through managing capacity • Speed to market • A lot of data originates in cloud or is moving towards cloud • Leveraging Marriott cloud on Softlayer • Interesting challenges • Network bandwidth constraints data that does not originate in cloud • Data needs to be appropriately secured • Organization needs processes to function at cloud speeds
  12. 12. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION View WithTools 11 Operational DataOperational DataOperational DataOperational DataOperational Data Data Warehouses Landing Zone Analytical Workspace PUSH VIEW PULL LOAD Relational Data Store HDFS Data Store DW LZ On Workstation Server or Workstation On Workstation On Workstation External DataExternal Data Sources On Workstation Other ODBC Connections Legend: Phase 1 Future Phase Netezza DPS Federate Dataclick Aginity IBM Db2 Apache Parquet
  13. 13. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Approach to Data Structures 12 - Initial focus on structured data from existing warehouses - Source structured as stars, snowflakes - Release 1 approach: - Replicate existing structures as is - Optimize physical data model for BigSQL use - Create next generation of marts in future releases
  14. 14. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Pushing Data to Cloud Required a Custom App 13 - Build vs Buy – developed an application from scratch - Metadata-driven - Compresses data in memory, delivers to cloud using SSH/SCP - Applied to target table through ETL code - Implemented Change Data Capture patterns - Implemented to send minimal data over the network - Timestamp-based - XID-based - Full comparison-based / relying on existing deltas - Full replace
  16. 16. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Unexpected Learnings While Pushing Data 15 - Horizontal (range) partitioning a key optimization technique - Some of the source tables not conducive to partitioning as is - Added ETL code to append natural key - Using Hive for ETL - BigSQL 3.0 good at querying, bottlenecks at writes - Writing large number of rows best done in Hive - HiveQL a new skillset for the organization - Time spent on addressing data quality issues - Line terminators (OxOA, OxOC) replaced with blanks - Backslash (), various double quotes escaped with “”
  17. 17. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Load: Uploading Data 16 - Business problem: upload dimension or a small fact from CSV or Excel - Big Insights 3.0 user experience is disjointed - Partnered with Aginity - Workbench for Hadoop: free tool, available at - Natively supports BigSQL
  20. 20. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Getting It Is A Bit Complicated 19 - No native way to bulk upload without opening lots of ports - Uploaded into catalog DB2 instance with bulk APIs .NET Db2 Driver BigSQL coordinator Db2 SMP instance Parquet table Batched API call Batched inserts into row based LOAD HADOOP command Aginity
  21. 21. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Other Ways of Bringing Data In 20 - View: federation works - Successfully federated Netezza to BigSQL - Lag between source and target creates a governance challenge - Did not heavily exercise yet - In cloud, difficult to quantify network usage - Pull: DataClick implementation was postponed - No way to do high performance uploads (ODBC driver) - Users have to convert to Parquet, collect statistics themselves - Impossible to trim data - Use and administration is not trivial
  22. 22. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Learnings From Porting Code 21 - Getting queries from Netezza was easy - Ported queries, not stored procedures or views - Followed typical best practices - Horizontal partitioning most effective performance optimization technique - Statistics - Having BigSQL statistics on each column vital - ANALYZE statement expensive, breaks occasionally, reruns - Column group statistics help if there’s a significant skew
  23. 23. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Sample RunTimes 22 • Data volume: about 1 tb • Run times ratio ranges from 0.5 to 3x Netezza BigSQL # Query Name Seconds Seconds 1 A1 AW UAT AO Easy 258 769 2 A2 AW UAT AO Medium 1 73 168 3 A3 AW UAT AO Medium 2 156 204 4 A4 AW UAT AO Complex 1 282 436 5 A5 AW UAT AO Complex 2 529 363 6 D1 AW UAT OR Easy 716 431 7 D2 AW UAT OR Medium 254 377 8 D3 AW UAT OR Complex 1 1187 741 9 D4 AW UAT OR Complex 2 438 110 10 D5 AW UAT OR Complex 3 408 1085
  24. 24. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Binary Collation Is Standard in Hadoop Ecosystem 23 - “Hello” won’t match / join / sort with “Hello “ - User data, data warehouse contain trailing blanks at times - Impossible to influence this behavior in Big Insights 3.0 - Options: - Trim all character types during upload - Trim at query time - Appreciate the existing behavior :)
  25. 25. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Other Notes 24 - BigSQL 3.0 suitable for writing out small tables, bottlenecks on large ones - Ganglia irreplaceable for monitoring hardware - Unable to push BigInsights metrics to Ganglia in 3.0 - SAS integration successful - Great example of BigSQL thinly veiled as Db2
  26. 26. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Current State 25 - In production - Some datasets already enabled, more on the way - Initial group of users started in the environment - Improving monitoring, error reporting and recovery practices
  27. 27. MARRIOTT INTERNATIONAL CONFIDENTIAL & PROPRIETARY INFORMATION Next Steps 26 - Expanding user base - Design next generation of data marts - Migrating to Big Insights 4.1 - Leveraging BigSQL for large writes - Configuring High Availability - Using built in text analytics functions on unstructured data - Considering high speed file transfer software as transport layer - Trimming character data during upload in Aginity
  28. 28. What I will be talking about… • I’m going to touch on some aspects of BigSQL • Data ingesting is always challenging, and we heard it from Marriott • Spark is the new buzz word What does it mean for BigSQL? 27
  29. 29. Data Movement • LOAD HADOOP • DataStage • Aspera • Partners like Aginity • BigSQL Federation 28
  30. 30. Big SQL LOAD command • Where can the data come from? Database via parallel JDBC • DB2, Netezza, Teradata, Oracle, SQL Server, MySQL, Postgres • Generic JDBC (Informix and IMS can use this) CSV files on HDFS SFTP • Where can the data go to? Any BigSQL/Hive table • Special features: User can control the parallelism Rejected/bad rows are saved to a file Control of how to manipulate input data (e.g. delimiters) Control of how to write data (e.g. compression, reject nulls) 29
  32. 32. Information Server - DataStage • BigInsights was shipping with a DataClick teaser-version That explains the lack of functionality and the lack of performance that Marriott encountered We don’t like teaser versions anymore… This is not representative of DataStage! Note that there are several names for the same technology and/or different configurations and deployments • DataStage • BigIntegrate & BigQuality • InfoSphere Information Server • DataWorks • DataClick 31
  33. 33. IBM BigInsights BigIntegrate & BigQuality Information Server on Hadoop Hadoop Platform HDFS YARN high speed extract / load (redundant, reliable storage) (cluster resource management) BigIntegrate BigQuality connect, transform, shape, deliver profile, classify, cleanse, monitor high speed native access high speed ingest Data Integration, Quality and Governance Tooling Data Engineers Data Analyst Developers
  34. 34. Big SQL Query federation • Data never lives in isolation Either as a landing zone or a queryable archive it is desirable to query data across Hadoop and active Data warehouses • Big SQL provides the ability to query heterogeneous systems Join Hadoop to other relational databases Query optimizer understands capabilities of external system • Including available statistics As much work as possible is pushed to each system to process 33 Head Node Big SQL Data Node Task Tracker Data Node Big SQL Data Node Task Tracker Data Node Big SQL Data Node Task Tracker Data Node Big SQL Data Node Task Tracker Data Node Big SQL
  35. 35. Big SQL Federation What does is really look like? • After some DDL statements that creates a “nick name” for a remote table… create server my_db type teradata … create nickname T2(...) for server my_db • This is what the SQL looks like: Select * from T1, T2 where = and T2.price > 10.50 • Federation is totally invisible from SQL!!! 34
  36. 36. Federation supported data sources • Teradata V12, 13, 14 • Oracle 11g, 11gR1, 11gR2, 12c • Microsoft 2005, 2008, 2008R2, 2012 • DB2 9.7, 9.8, 10.1, 10.5 • Netezza 4.6, 5.0, 6.0, 7.2 • For more details: 35
  37. 37. Aginity, an IBM partner • Chosen by Marriott 36
  38. 38. Aspera, and IBM company • Claim to fame: Very fast data transfer via compression and connection management • Grew into a much broader offering • Great for cloud environments 37
  39. 39. Aspera Product Portfolio TRANSFER CLIENTS WEB APPLICATIONS MANAGEMENT & AUTOMATION SYNCHRONIZATION FASP™ PATENTED HIGH-SPEED TRANSPORT TRANSFER SERVERS Web, Desktop, Email, Mobile, Embedded Private On Premise Distribution, sharing, collaboration and exchange Transfer management, monitoring and automation Scalable, high-performance synchronization and replication Any Data Size, Any Distance, Any Network Conditions Any Infrastructure: Block, Object, On Premise, Cloud Public and Private Cloud Hybrid 38
  40. 40. Spark 39
  41. 41. Questions from our customers • What about Spark? • When should I use Spark SQL? • When should I use BigSQL? • Is Spark SQL fast? 40
  42. 42. Questions from our customers • What about Spark? Spark is built for “analytics”, machine learning But SQL is so great that everybody has to have SQL… Using NoteBooks as the canvas Using SQL to do certain steps that are easy in SQL. • When should I use Spark SQL? It will be on your fingertips when you use Spark and its tooling Very easy in the Java/Scala/Phyton environment of Spark • When should I use BigSQL? Obvious for SQL-centric applications Very easy for remotely connecting via JDBC/ODBC • Is Spark SQL fast? 41
  43. 43. © 2015 IBM Corporation42 Current State of the Art: Big SQL runs more SQL out-of-box Big SQL 4.1 Spark SQL 1.5.0 1 hour 3-4 weeksPorting Effort: Big SQL is the only engine that can execute all 99 queries with minimal porting effort
  44. 44. © 2015 IBM Corporation43 … what happens when you scale it? Scale Single Stream 4 Concurrent Streams 1 TB • Big SQL was faster on 76 / 99 Queries • Big SQL averaged 5.5X faster • Removing Top / Bottom 5, Big SQL averaged 2.5X faster • Spark SQL FAILED on 3 queries • Big SQL was 4.4X faster* 10 TB • Big SQL was faster on 80/99 Queries • Spark SQL FAILED on 7 queries • Big SQL averaged 6.2X faster* • Removing Top / Bottom 5, Big SQL averaged 4.6X faster • Big SQL elapsed time for workload was better than linear • Spark SQL could not complete the workload (numerous issues). Partial results possible with only 2 concurrent streams. *Compares only queries that both Big SQL and Spark SQL could complete (benefits Spark SQL) More Users MoreData
  45. 45. © 2015 IBM Corporation44 Choose the Right Tool for the Right Job Machine Learning transformation Simpler SQL Good Performance Ideal tool for BI Data Analysts and production workloads Ideal tool for Data Scientists and discovery Big SQL Spark SQL Migrating existing workloads to Hadoop Security Many Concurrent Users Best in-class Performance Big SQL & Spark SQL co-exist in the cluster
  46. 46. We Value Your Feedback! Don’t forget to submit your Insight session and speaker feedback! Your feedback is very important to us – we use it to continually improve the conference. Access the Insight Conference Connect tool at to quickly submit your surveys from your smartphone, laptop or conference kiosk. 45
  47. 47. 46 Notices and Disclaimers Copyright © 2015 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM. U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM. Information in these presentations (including information relating to products that have not yet been announced by IBM) has been reviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided. Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice. Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary. References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business. Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation. It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or represent or warrant that its services or products will ensure that the customer is in compliance with any law.
  48. 48. 47 Notices and Disclaimers (con’t) Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right. • IBM, the IBM logo,, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at:
  49. 49. © 2015 IBM Corporation Thank You